©Bull 2012
Overlay HPC Information
1
Christian Kniep
R&D HPC Engineer2014-06-25
©Bull 2014
About Me
2
‣ 10y+ SysAdmin
‣ 8y+ SysOps
‣ B.Sc. (2008-2011)
‣ 6y+ DevOps
‣ 1y+ R&D- @CQnib - http://blog.qnib.org - https://github.com/ChristianKniep
©Bull 2014
My ISC History - Motivation
3
©Bull 2014
My ISC History - Description
4
©Bull 2014
HPC Software Stack (rough estimate)
5
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Mgm
t
SysO
ps
SysO
ps M
gmt
User
Power User/ISV
ISV Mgm
t
Services:! ! Storage, Job Scheduler
HW
©Bull 2014
HPC Software Stack (goal)
6
Hardware:! ! HW-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - History
7
!!!
• Created my own
!!
• No useful tools in sight
©Bull 2014
QNIB
8
‣ Cluster of n*1000+ IB nodes • Hard to debug
!!!
• Created my own - Graphite-Update in late 2013
!!
• No useful tools in sight
©Bull 2014
QNIB
9
‣ Cluster of n*1000+ IB nodes • Hard to debug
©Bull 2014
Achieved HPC Software Stack
10
Hardware:! ! IB-sensors/-errors
OS:! ! ! Kernel, Userland tools
MiddleWare:! MPI, ISV-libs
Software:! ! End user application
Excel:!! ! KPI, SLA
Services:! ! Storage, Job Scheduler
Log/Events
Perf
©Bull 2012
QNIBTerminal - Implementation
11
©Bull 2014
QNIBTerminal -blog.qnib.org
12
haproxy haproxy
dnshelixdns
elk
kibana
logstash
etcd
carboncarbon
graphite-webgraphite-web
graphite-apigraphite-api
grafanagrafana
slurmctldslurmctld
compute0slurmd
compute<N>slurmd
Log/Events
Services Performance
Compute
elasticsearch
©Bull 2012
DEMONSTRATION
13
©Bull 2012
Future Work
14
©Bull 2014
More Services
15
‣ Improve work-flow for log-events ‣ Nagios(-like) node is missing ‣ Cluster-FileSystem ‣ LDAP ‣ Additional dashboards ‣ Inventory ‣ using InfiniBand for communication traffic
©Bull 2014
Graph Representation
16
‣ Graph inventory needed • Hierarchical view is not enough
©Bull 2014
Graph Representation
17
!!
• GraphDB seems to be a good idea
comp0 comp1 comp2
ibsw0
eth1
eth10
ldap12
lustre0
ibsw2
‣ Graph inventory needed • Hierarchical view is not enough
RETURN node=comp* WHERE ROUTE TO lustre_service INCLUDES ibsw2
©Bull 2012
Conclusion
18
!!!!!!!!!!‣ Training
• New SysOps could start on virtual cluster • ‚Strangulate’ node to replay an error.
!!!!!!!‣ Showcase
• Showing a customer his (to-be) software stack • Convince the SysOps-Team ‚they have nothing to fear‘
!!!‣ complete toolchain could be automated
• Testing • Verification • Q&A
!!‣ n*1000 containers through clustering
©Bull 2014
Conclusion
19
‣ n*100 of containers are easy (50 on my laptop) • Running a 300 node cluster stack
©Bull 2014
Log AND Performance Management
20
‣ Metric w/o Logs are useless!
©Bull 2014
Log AND Performance Management
21
‣ Metric w/o Logs are useless!!• and the other way around…
©Bull 2014
Log AND Performance Management
22
!!
• overlapping is king
‣ Metric w/o Logs are useless!!• and the other way around…
©Bull 2012 23