14
Linux Clusters Ins.tute: Monitoring Kyle Hutson – System Administrator for Kansas State University [email protected]

LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% [email protected]%

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

Linux  Clusters  Ins.tute:  Monitoring

Kyle  Hutson  –  System  Administrator  for  Kansas  State  University  [email protected]  

Page 2: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

Why  monitoring?

• How  should  we  get  no=fied?  • What  should  we  monitor?  • How  oAen  should  we  monitor?  •  Internal  vs  external  •  Informa=onal  vs  urgent  

2 May 20, 2015

Page 3: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

How  should  we  get  no.fied?

• Urgent:  •  Email  or  text  •  Define  this  carefully  

• Not-­‐so  urgent:  • Web  page  updates  

•  Especially  helpful  for  historical  data  •  Email  (filtered)  •  End-­‐user  support  requests  

3 May 20, 2015

Page 4: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

What  should  we  monitor?

•  External:  Basic  Connec=vity  •  Internal:  

•  The  urgent  •  Power  status  •  Scheduler/head  node  status  •  Cold-­‐aisle  temperatures  •  Storage  system  

4 May 20, 2015

Page 5: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

Lots  of  li?le  things

• Overall  cluster  health  •  Queue  size  •  Overall  network  usage  •  Number  of  responding  nodes  

•  Individual  node  health  •  Load  average  •  Memory  used  •  Network  bandwidth  •  CPU  usage  •  Temperature  

•  Storage  •  Capacity  •  Degraded  status  •  Connec=vity  

5 May 20, 2015

Page 6: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

Security

•  Securing  the  cluster  •  Security  status  updates  •  Any  failures  

•  sudo  reports  •  Network  login  failures  (e.g.  fail2ban)  •  crontab  failures  •  Logfile  errors  (customize  to  fit)  

6 May 20, 2015

Page 7: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

How  oBen?

•  You  will  quickly  get  a  feel  for  this  •  Too  much  info  is  o,en  worse  than  too  li3le  info  •  The  “urgent”  –  con=nually  •  The  “not-­‐so-­‐urgent”  –  anywhere  from  a  few  =mes  per  day  to  once  per  week  

•  There’s  nothing  wrong  with  trial  and  error  

7 May 20, 2015

Page 8: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

How  to  make  it  happen

• Nagios/NRPE  (Nagios  Remote  Plugin  Executor)  •  Generic  executable  that  runs  “plugins”  

•  Plugins  can  monitor  just  about  anything  you  can  think  of  monitoring  •  Even  works  with  Windows  •  Nagios  (hap://www.nagios.org/)  is  by  far  the  most  common  monitoring  system  

8 May 20, 2015

Page 9: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

How  to  make  it  happen

9 May 20, 2015

Page 10: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

How  to  make  it  happen

•  Icinga  (haps://www.icinga.org/)  •  Can  use  NRPE  •  (New)  version  2  has  its  own  client  •  Uses  database  backend  for  history  •  Mul=-­‐threaded  and  mul=homed  

10 May 20, 2015

Page 11: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

How  to  make  it  happen

11 May 20, 2015

Page 12: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

How  to  make  it  happen

• Ganglia  (hap://ganglia.sourceforge.net/)  -­‐  for  historical  and  resource  monitoring  

•  Ours  are  public  •  RRD  files  give  historical  data  (a.k.a.  “lots  of  preay  graphs”)  

12 May 20, 2015

Page 13: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

How  to  make  it  happen

13 May 20, 2015

Page 14: LinuxClustersInstute: Monitoring · LinuxClustersInstute: Monitoring Kyle%Hutson%–System%Administrator%for%Kansas%State%University% kylehutson@ksu.edu%

How  to  make  it  happen

• New  alterna=ve  to  Ganglia:  Graphite  (hap://graphite.wikidot.com/)  • Uses  “whisper”  instead  of  RRD  (smaller  files)  •  Scaling  is  beaer  than  Ganglia  • Dynamic  data  points  let  you  see  exactly  what  you  want  (with  some  prac=ce)  

•  S=ll  in  beta  

14 May 20, 2015