Download pdf - Overlay HPC Information

©Bull 2012

Overlay HPC Information

1

Christian Kniep

R&D HPC Engineer2014-06-25

©Bull 2014

About Me

2

‣ 10y+ SysAdmin

‣ 8y+ SysOps

‣ B.Sc. (2008-2011)

‣ 6y+ DevOps

‣ 1y+ R&D- @CQnib - http://blog.qnib.org - https://github.com/ChristianKniep

©Bull 2014

My ISC History - Motivation

3

©Bull 2014

My ISC History - Description

4

©Bull 2014

HPC Software Stack (rough estimate)

5

Hardware:! ! HW-sensors/-errors

OS:! ! ! Kernel, Userland tools

MiddleWare:! MPI, ISV-libs

Software:! ! End user application

Excel:!! ! KPI, SLA

Mgm

t

SysO

ps

SysO

ps M

gmt

User

Power User/ISV

ISV Mgm

t

Services:! ! Storage, Job Scheduler

HW

©Bull 2014

HPC Software Stack (goal)

6

Hardware:! ! HW-sensors/-errors




Excel:!! ! KPI, SLA


Log/Events

Perf

©Bull 2012

QNIBTerminal - History

7

!!!

• Created my own

!!

• No useful tools in sight

©Bull 2014

QNIB

8

‣ Cluster of n*1000+ IB nodes • Hard to debug

!!!

• Created my own - Graphite-Update in late 2013

!!

• No useful tools in sight

©Bull 2014

QNIB

9

‣ Cluster of n*1000+ IB nodes • Hard to debug

©Bull 2014

Achieved HPC Software Stack

10

Hardware:! ! IB-sensors/-errors




Excel:!! ! KPI, SLA


Log/Events

Perf

©Bull 2012

QNIBTerminal - Implementation

11

©Bull 2014

QNIBTerminal -blog.qnib.org

12

haproxy haproxy

dnshelixdns

elk

kibana

logstash

etcd

carboncarbon

graphite-webgraphite-web

graphite-apigraphite-api

grafanagrafana

slurmctldslurmctld

compute0slurmd

compute<N>slurmd

Log/Events

Services Performance

Compute

elasticsearch

©Bull 2012

DEMONSTRATION

13

©Bull 2012

Future Work

14

©Bull 2014

More Services

15

‣ Improve work-flow for log-events ‣ Nagios(-like) node is missing ‣ Cluster-FileSystem ‣ LDAP ‣ Additional dashboards ‣ Inventory ‣ using InfiniBand for communication traffic

©Bull 2014

Graph Representation

16

‣ Graph inventory needed • Hierarchical view is not enough

©Bull 2014

Graph Representation

17

!!

• GraphDB seems to be a good idea

comp0 comp1 comp2

ibsw0

eth1

eth10

ldap12

lustre0

ibsw2

‣ Graph inventory needed • Hierarchical view is not enough

RETURN node=comp* WHERE ROUTE TO lustre_service INCLUDES ibsw2

©Bull 2012

Conclusion

18

!!!!!!!!!!‣ Training

• New SysOps could start on virtual cluster • ‚Strangulate’ node to replay an error.

!!!!!!!‣ Showcase

• Showing a customer his (to-be) software stack • Convince the SysOps-Team ‚they have nothing to fear‘

!!!‣ complete toolchain could be automated

• Testing • Verification • Q&A

!!‣ n*1000 containers through clustering

©Bull 2014

Conclusion

19

‣ n*100 of containers are easy (50 on my laptop) • Running a 300 node cluster stack

©Bull 2014

Log AND Performance Management

20

‣ Metric w/o Logs are useless!

©Bull 2014


21

‣ Metric w/o Logs are useless!!• and the other way around…

©Bull 2014


22

!!

• overlapping is king

‣ Metric w/o Logs are useless!!• and the other way around…

©Bull 2012 23