View
3
Download
0
Category
Preview:
Citation preview
MonitoringHTCondorAMarriageofOpenSourceandCustom
WilliamDeck
Goals
• Helpthoselookingtoaugment/rollouttheirownclusterMonitoring• Possiblyadifferentperspectiveonmonitoring
•Getsomeofthisworkoutintheopen
Who am I?• Originally a mainframe developer• Moved to SIG January of 2015• First task: Monitor the “cluster”
OurCluster
• MixedCluster(Several)• ~1700cores
• 1100Windows• 600Linux(SLES11/SLES12)
• Storage• Lustre:2.8PB• GPFS:8.4PB
OurWorkload
• ~100Kjobs/week• MixedOSWorkloads(MajorityWindows)• DailyWorkflows(1– 10K)• OneOffWorkflows(+100K)
WhatistheProblem?• Manymovingparts• Multiplegroupsusingcondordifferently
• Singlesubmitfiles• DAGs• Someonefoundoutaboutcondor_run
• Monitor1– 100kjobs• Hardforuserstoself-support• Correlatejobfailurestoinfrastructureproblems
MonitorEvolution
• HTCondor’sJobMonitor• CycleComputingSolution• CustomScriptsparsingcondor_*commands• PythonbindingswithElastic,Grafana,andConmon
FirstSolution– TalesofCondor
• ModelledafterCycleComputingmonitoring• Parsingcondor_status,condor_historyoutput• Putinformationintocsvfiles• Simplewebpagedisplayed
• Totaljobsvsrunningjobs• Totaljobs/Uservsrunningjobs/User
FirstSolution:Revolutionarycommandgrep
• Foreverythingelseweusedgrep• Painfulandtedious• Goodnews!
• Reallygoodatgrep
Enter:Monmon
• Customwebpagedesignedaroundourworkflowcreationscripts• Parsesnodestatus/HTCondorlogfiles• Specificforonegroup’sworkflows• Usefulfortheusertodrilldownandsharewithothers
EnterElastic,Grafana,andConmon
• Elastic(ELK)(https://www.elastic.co/)
– Elasticsearch- Search/analyticsengine
– Logstash- Ingestofdata
– Kibana- Frontend
• Grafana- (https://grafana.com/)
– extensionofKibana(3.0)
– Multiplebackends(graphite,ES,influxdb...)
EnterElastic,Grafana,andConmon
• Pollperiodicallyusingpythonbindings• InsertintoES
• AugmentedJob/HostClassAd• CustomsubsetofClassAd
• UseGrafanafrontend• TalesofCondor2.0• MoreComplexDashboards• ExtendedMonmon->Conmon
ESJobAdExample
● Searchable Job ClassAd● Common Uses
○ condor_history○ Used Resources
Grafana:OneStopShop
GrafanaDashboards
• EasilycreatedashboardsfromESandperformancemetrics• SinglePaneofGlassforCondorandInfrastructure
TalesofCondor2.0
OverallClusterHealth
Grafana:ErrorRates
• Findblackholeexec• Postscriptvs.ExitCode• Errorsby:
• User• DAG• IsitUserorinfrastructure
Conmon
• UsesFlask• ESQueries• Displayjobclassad• Workflowoverview• GridView• DAG/JobLogs• WorkflowAnalysis
Conmon:Home(DAGs)
Conmon:Workflow(DAGs)
Conmon:Grid(DAGs)
WorkflowAnalysis
• InformationfromES• Easytoseelonglegs
IndividualJobs
Conmon:Benefits• LoadingCondorinformationfromES• Canhandlemultiplesubmission/workflowstypes(DAGs,submit)• Usercanclickthroughjobs• Search/Filter• Shareable• Consistentviews
Questions?
Recommended