Overlay HPC Information

Preview:

DESCRIPTION

In this presentation from ISC'14, Christian Kniep from Bull presents: Understand Your Cluster by Overlaying Multiple Information Layers. Kniep is using Docker technology in a novel way to ease the administration of InfiniBand networks. "Today's data center managers are burdened by a lack of aligned information of multiple layers. Work-flow events like 'job XYZ starts' aligned with performance metrics and events extracted from log facilities are low-hanging fruit that is on the edge to become use-able due to open-source software like Graphite, StatsD, logstash and alike. This talk aims to show off the benefits of merging multiple layers of information within an InfiniBand cluster by using use-cases for system operation and management level personal. Mr. Kniep held two BoF sessions which described the lack of InfiniBand (ISC'12) and generic HPC monitoring (ISC'13). This years' session aims to propose a way to fix it. To drill into the issue, Mr. Kniep uses his recently started project QNIBTerminal to spin up a complete clusterstack using LXC containers."

Citation preview

©Bull 2012

Overlay HPC Information

1

Christian Kniep

R&D HPC Engineer2014-06-25

©Bull 2014

About Me

2

‣ 10y+ SysAdmin

‣ 8y+ SysOps

‣ B.Sc. (2008-2011)

‣ 6y+ DevOps

‣ 1y+ R&D- @CQnib - http://blog.qnib.org - https://github.com/ChristianKniep

©Bull 2014

My ISC History - Motivation

3

©Bull 2014

My ISC History - Description

4

©Bull 2014

HPC Software Stack (rough estimate)

5

Hardware:! ! HW-sensors/-errors

OS:! ! ! Kernel, Userland tools

MiddleWare:! MPI, ISV-libs

Software:! ! End user application

Excel:!! ! KPI, SLA

Mgm

t

SysO

ps

SysO

ps M

gmt

User

Power User/ISV

ISV Mgm

t

Services:! ! Storage, Job Scheduler

HW

©Bull 2014

HPC Software Stack (goal)

6

Hardware:! ! HW-sensors/-errors

OS:! ! ! Kernel, Userland tools

MiddleWare:! MPI, ISV-libs

Software:! ! End user application

Excel:!! ! KPI, SLA

Services:! ! Storage, Job Scheduler

Log/Events

Perf

©Bull 2012

QNIBTerminal - History

7

!!!

• Created my own

!!

• No useful tools in sight

©Bull 2014

QNIB

8

‣ Cluster of n*1000+ IB nodes • Hard to debug

!!!

• Created my own - Graphite-Update in late 2013

!!

• No useful tools in sight

©Bull 2014

QNIB

9

‣ Cluster of n*1000+ IB nodes • Hard to debug

©Bull 2014

Achieved HPC Software Stack

10

Hardware:! ! IB-sensors/-errors

OS:! ! ! Kernel, Userland tools

MiddleWare:! MPI, ISV-libs

Software:! ! End user application

Excel:!! ! KPI, SLA

Services:! ! Storage, Job Scheduler

Log/Events

Perf

©Bull 2012

QNIBTerminal - Implementation

11

©Bull 2014

QNIBTerminal -blog.qnib.org

12

haproxy haproxy

dnshelixdns

elk

kibana

logstash

etcd

carboncarbon

graphite-webgraphite-web

graphite-apigraphite-api

grafanagrafana

slurmctldslurmctld

compute0slurmd

compute<N>slurmd

Log/Events

Services Performance

Compute

elasticsearch

©Bull 2012

DEMONSTRATION

13

©Bull 2012

Future Work

14

©Bull 2014

More Services

15

‣ Improve work-flow for log-events ‣ Nagios(-like) node is missing ‣ Cluster-FileSystem ‣ LDAP ‣ Additional dashboards ‣ Inventory ‣ using InfiniBand for communication traffic

©Bull 2014

Graph Representation

16

‣ Graph inventory needed • Hierarchical view is not enough

©Bull 2014

Graph Representation

17

!!

• GraphDB seems to be a good idea

comp0 comp1 comp2

ibsw0

eth1

eth10

ldap12

lustre0

ibsw2

‣ Graph inventory needed • Hierarchical view is not enough

RETURN node=comp* WHERE ROUTE TO lustre_service INCLUDES ibsw2

©Bull 2012

Conclusion

18

!!!!!!!!!!‣ Training

• New SysOps could start on virtual cluster • ‚Strangulate’ node to replay an error.

!!!!!!!‣ Showcase

• Showing a customer his (to-be) software stack • Convince the SysOps-Team ‚they have nothing to fear‘

!!!‣ complete toolchain could be automated

• Testing • Verification • Q&A

!!‣ n*1000 containers through clustering

©Bull 2014

Conclusion

19

‣ n*100 of containers are easy (50 on my laptop) • Running a 300 node cluster stack

©Bull 2014

Log AND Performance Management

20

‣ Metric w/o Logs are useless!

©Bull 2014

Log AND Performance Management

21

‣ Metric w/o Logs are useless!!• and the other way around…

©Bull 2014

Log AND Performance Management

22

!!

• overlapping is king

‣ Metric w/o Logs are useless!!• and the other way around…

©Bull 2012 23

Recommended