deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom...

Preview:

Citation preview

MonitoringHTCondorAMarriageofOpenSourceandCustom

WilliamDeck

Goals

• Helpthoselookingtoaugment/rollouttheirownclusterMonitoring• Possiblyadifferentperspectiveonmonitoring

•Getsomeofthisworkoutintheopen

Who am I?• Originally a mainframe developer• Moved to SIG January of 2015• First task: Monitor the “cluster”

OurCluster

• MixedCluster(Several)• ~1700cores

• 1100Windows• 600Linux(SLES11/SLES12)

• Storage• Lustre:2.8PB• GPFS:8.4PB

OurWorkload

• ~100Kjobs/week• MixedOSWorkloads(MajorityWindows)• DailyWorkflows(1– 10K)• OneOffWorkflows(+100K)

WhatistheProblem?• Manymovingparts• Multiplegroupsusingcondordifferently

• Singlesubmitfiles• DAGs• Someonefoundoutaboutcondor_run

• Monitor1– 100kjobs• Hardforuserstoself-support• Correlatejobfailurestoinfrastructureproblems

MonitorEvolution

• HTCondor’sJobMonitor• CycleComputingSolution• CustomScriptsparsingcondor_*commands• PythonbindingswithElastic,Grafana,andConmon

FirstSolution– TalesofCondor

• ModelledafterCycleComputingmonitoring• Parsingcondor_status,condor_historyoutput• Putinformationintocsvfiles• Simplewebpagedisplayed

• Totaljobsvsrunningjobs• Totaljobs/Uservsrunningjobs/User

FirstSolution:Revolutionarycommandgrep

• Foreverythingelseweusedgrep• Painfulandtedious• Goodnews!

• Reallygoodatgrep

Enter:Monmon

• Customwebpagedesignedaroundourworkflowcreationscripts• Parsesnodestatus/HTCondorlogfiles• Specificforonegroup’sworkflows• Usefulfortheusertodrilldownandsharewithothers

EnterElastic,Grafana,andConmon

• Elastic(ELK)(https://www.elastic.co/)

– Elasticsearch- Search/analyticsengine

– Logstash- Ingestofdata

– Kibana- Frontend

• Grafana- (https://grafana.com/)

– extensionofKibana(3.0)

– Multiplebackends(graphite,ES,influxdb...)

EnterElastic,Grafana,andConmon

• Pollperiodicallyusingpythonbindings• InsertintoES

• AugmentedJob/HostClassAd• CustomsubsetofClassAd

• UseGrafanafrontend• TalesofCondor2.0• MoreComplexDashboards• ExtendedMonmon->Conmon

ESJobAdExample

● Searchable Job ClassAd● Common Uses

○ condor_history○ Used Resources

Grafana:OneStopShop

GrafanaDashboards

• EasilycreatedashboardsfromESandperformancemetrics• SinglePaneofGlassforCondorandInfrastructure

TalesofCondor2.0

OverallClusterHealth

Grafana:ErrorRates

• Findblackholeexec• Postscriptvs.ExitCode• Errorsby:

• User• DAG• IsitUserorinfrastructure

Conmon

• UsesFlask• ESQueries• Displayjobclassad• Workflowoverview• GridView• DAG/JobLogs• WorkflowAnalysis

Conmon:Home(DAGs)

Conmon:Workflow(DAGs)

Conmon:Grid(DAGs)

WorkflowAnalysis

• InformationfromES• Easytoseelonglegs

IndividualJobs

Conmon:Benefits• LoadingCondorinformationfromES• Canhandlemultiplesubmission/workflowstypes(DAGs,submit)• Usercanclickthroughjobs• Search/Filter• Shareable• Consistentviews

Questions?

Recommended