Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd,

Embed Size (px)

Citation preview

  • Cluster 2004San Diego, CA

    A Client-centric Grid Knowledgebase

    George Kola, Tevfik Kosar and Miron LivnyUniversity of Wisconsin-Madison

    September 23rd, 2004

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Grid TriviaHow many of you have submitted a job to the Grid resources and did never hear back from it?How many of you got mad by the inconsistent behavior of some grid resources? Completing successfully some jobs and failing others..Similar jobs performing completely different..

    ... We did!

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Goal: Prevent Unexpected Behavior in a GridLearn from experience and prevent them from repeating in the future again.Causes for unexpected behavior in a Grid:Black holesResources withFaulty hardwareBuggy or misconfigured softwareExtremely slow computational sitesMemory leaks ..etc

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Black holes

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Black holesDefinition: A black hole is a region of spacetime from which nothing can escape, even light.If you send a light beam to a black hole, you never hear back from it.You can only know it after you have encounter it. Is it too late?No. You should learn from experience..

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Black holes in the GridResources that accept jobs but never complete themYou send a job to a resource, but never hear back from it.

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Black hole examples from real life:In the WCER educational video processing pipeline:A specific pool was accepting and processing our jobs for a couple of hours, but evicting before completion.A machine accepted a job, but due to a memory leak it kept throwing shadow exceptions and retrying the job forever.Some thirdparty (GridFTP, DiskRouter) transfers hang occasionally and never returned. A machine caused an error because of a corrupted FPU. It successfully completed MPEG-1 encoding but failed MPEG-4.

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Grid is good.. but not perfect..Heterogeneous resourcesMulti administrative domainsSpanning wide area networks Consists of commodity hardware and softwareProne to network-, hardware-, software-, middleware- failures!

    We cannot expect everything from the Grid or Grid middleware!

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Take the Ethernet ApproachA truly distributed (and very effective) access control protocol to a shared serviceClient responsible access controlClient responsible for error detectionClient responsible for fairnessKeep track of job/resource performance & failure characteristics as observed by the client.Use job/user log files collected at the client side to build a grid knowledgebase.

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Grid KnowledgebaseParse user/job log filesLoad them into a databaseAggregate experience of different jobsInterpret themPlan actionGenerate feedback to the scheduler as well as to the user

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    JOB LOGSGRID RESOURCESPersonal ComputersStorage ServersClusters

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    JOB LOGSGRID RESOURCESPersonal ComputersStorage ServersClustersGRID KNOWLEDGEBASE

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Database SchemaEvictedSubmitTerminatedAbnormallyTerminatedNormallySchedule

    Execute

    Job SucceededJob FailedYesNoSuspendUn-suspendUser

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Difference from existing approachesClient viewUse only job/user log files at the client sideMany administrators do not want to share resource/scheduler log files.We do not need to know everything going on in the whole gridScalable

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    What do we get?Collecting job execution time statisticsAverage job execution timeStandard deviationFit a distributionDetect and avoid black holesFor normal distribution:99.7% of job execution times should lie between (avg-3*stdev) and (avg+3*stdev)96% of job execution times should lie between (avg-2*stdev) and (avg+2*stdev)

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Detecting hanging transfers

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Setting Execution Time LimitsAvg = 7.8 minStdev = 3.17minFor normal distribution:%99.7 : [0 17.31 min]%96 : [1.46 min 14.14 min]

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    What do we get? (2)Identifying misconfigured machinese.g. find set of machines which fail jobs with I/O data size larger than 2 GB (i.e. OS limitations)Identifying factors affecting job run-timeBug huntingJob failures on certain inputs Memory leaksScheduler logs image size regularly

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Catching Memory LeaksJob Memory Image Size (MB)Time

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    What do we get? (3)Application optimizationHow long does each step of an application/pipeline take to execute?AdaptationFind resources that take least time to execute jobs from a particular class

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    ConclusionsView of the Grid from the client sideJob/user log files as main source of informationAggregate experience of different jobs and pass them to future onesHelps in:Catching black holesIdentify faulty/misconfigured resourcesBug trackingStatistics collectionFuture work: Merge experience of different clients

    A Client-centric Grid KnowledgebaseGeorge Kola, Tevfik Kosar and Miron Livny

    Thank you

    For more information, contact:

    Tevfik Kosarhttp://www.cs.wisc.edu/[email protected]