28
Monitoring HTCondor A Marriage of Open Source and Custom William Deck

deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

MonitoringHTCondorAMarriageofOpenSourceandCustom

WilliamDeck

Page 2: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Goals

• Helpthoselookingtoaugment/rollouttheirownclusterMonitoring• Possiblyadifferentperspectiveonmonitoring

•Getsomeofthisworkoutintheopen

Page 3: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Who am I?• Originally a mainframe developer• Moved to SIG January of 2015• First task: Monitor the “cluster”

Page 4: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

OurCluster

• MixedCluster(Several)• ~1700cores

• 1100Windows• 600Linux(SLES11/SLES12)

• Storage• Lustre:2.8PB• GPFS:8.4PB

Page 5: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

OurWorkload

• ~100Kjobs/week• MixedOSWorkloads(MajorityWindows)• DailyWorkflows(1– 10K)• OneOffWorkflows(+100K)

Page 6: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

WhatistheProblem?• Manymovingparts• Multiplegroupsusingcondordifferently

• Singlesubmitfiles• DAGs• Someonefoundoutaboutcondor_run

• Monitor1– 100kjobs• Hardforuserstoself-support• Correlatejobfailurestoinfrastructureproblems

Page 7: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

MonitorEvolution

• HTCondor’sJobMonitor• CycleComputingSolution• CustomScriptsparsingcondor_*commands• PythonbindingswithElastic,Grafana,andConmon

Page 8: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

FirstSolution– TalesofCondor

• ModelledafterCycleComputingmonitoring• Parsingcondor_status,condor_historyoutput• Putinformationintocsvfiles• Simplewebpagedisplayed

• Totaljobsvsrunningjobs• Totaljobs/Uservsrunningjobs/User

Page 9: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

FirstSolution:Revolutionarycommandgrep

• Foreverythingelseweusedgrep• Painfulandtedious• Goodnews!

• Reallygoodatgrep

Page 10: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Enter:Monmon

• Customwebpagedesignedaroundourworkflowcreationscripts• Parsesnodestatus/HTCondorlogfiles• Specificforonegroup’sworkflows• Usefulfortheusertodrilldownandsharewithothers

Page 11: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

EnterElastic,Grafana,andConmon

• Elastic(ELK)(https://www.elastic.co/)

– Elasticsearch- Search/analyticsengine

– Logstash- Ingestofdata

– Kibana- Frontend

• Grafana- (https://grafana.com/)

– extensionofKibana(3.0)

– Multiplebackends(graphite,ES,influxdb...)

Page 12: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

EnterElastic,Grafana,andConmon

• Pollperiodicallyusingpythonbindings• InsertintoES

• AugmentedJob/HostClassAd• CustomsubsetofClassAd

• UseGrafanafrontend• TalesofCondor2.0• MoreComplexDashboards• ExtendedMonmon->Conmon

Page 13: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

ESJobAdExample

● Searchable Job ClassAd● Common Uses

○ condor_history○ Used Resources

Page 14: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Grafana:OneStopShop

Page 15: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

GrafanaDashboards

• EasilycreatedashboardsfromESandperformancemetrics• SinglePaneofGlassforCondorandInfrastructure

Page 16: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

TalesofCondor2.0

Page 17: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

OverallClusterHealth

Page 18: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing
Page 19: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Grafana:ErrorRates

• Findblackholeexec• Postscriptvs.ExitCode• Errorsby:

• User• DAG• IsitUserorinfrastructure

Page 20: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Conmon

• UsesFlask• ESQueries• Displayjobclassad• Workflowoverview• GridView• DAG/JobLogs• WorkflowAnalysis

Page 21: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Conmon:Home(DAGs)

Page 22: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Conmon:Workflow(DAGs)

Page 23: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Conmon:Grid(DAGs)

Page 24: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing
Page 25: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

WorkflowAnalysis

• InformationfromES• Easytoseelonglegs

Page 26: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

IndividualJobs

Page 27: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Conmon:Benefits• LoadingCondorinformationfromES• Canhandlemultiplesubmission/workflowstypes(DAGs,submit)• Usercanclickthroughjobs• Search/Filter• Shareable• Consistentviews

Page 28: deck sig monitor condor - University of Wisconsin–Madison · A Marriage of Open Source and Custom William Deck. Goals ... First Solution –Tales of Condor •Modelled after CycleComputing

Questions?