Upload
denis-dean
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
PROOFPROOF: the : the PParallel arallel ROOROOT T FFacilityacility
Scheduling andScheduling andLoad-balancingLoad-balancing
ACAT 2007ACAT 2007
Jan Iwaszkiewicz Jan Iwaszkiewicz ¹¹ ²²Gerardo Ganis Gerardo Ganis ¹¹
Fons Rademakers Fons Rademakers ¹¹
¹ ¹ CERN PH/SFTCERN PH/SFT² ² University of WarsawUniversity of Warsaw
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 22
OutlineOutline
• Introduction to Introduction to PParallel arallel ROOROOT T FFacilityacility
• Packetizer – load balancingPacketizer – load balancing
• Resource SchedulingResource Scheduling
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 33
Analysis of theAnalysis of theLLarge arge HHadron adron CCollierollier data data
• Necessity of distributed analysisNecessity of distributed analysis
• ROOT – popular particle physics data ROOT – popular particle physics data analysis frameworkanalysis framework
• PROOF (ROOT’s extension) – PROOF (ROOT’s extension) – automatically parallelizeautomatically parallelizess processing processing to computing clusters or multicore to computing clusters or multicore machinesmachines
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 44
Who is using PROOFWho is using PROOF• PHOBOSPHOBOS
– MIT, dedicated cluster, interfaced with CondorMIT, dedicated cluster, interfaced with Condor– Real data analysis, Real data analysis, in productionin production
• ALICEALICE– CERN Analysis Facility (CAF)CERN Analysis Facility (CAF)
• CMSCMS– Santander group, dedicated clusterSantander group, dedicated cluster– Physics TDR analysisPhysics TDR analysis
Very positive experienceVery positive experience• functionality, large speedup, efficientfunctionality, large speedup, efficient
But not really the LHC scenarioBut not really the LHC scenario• Usage limited to a few experienced usersUsage limited to a few experienced users
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 55
Using PROOF: exampleUsing PROOF: example
• PROOF is designed for analysis of independent PROOF is designed for analysis of independent objects, e.g. ROOT Trees (basic data format in objects, e.g. ROOT Trees (basic data format in partice physics)partice physics)
• Example of processing a set of ROOT trees: Example of processing a set of ROOT trees:
// Create a chain of treesroot[0] TChain *c = CreateMyChain();
// MySelec is a TSelectorroot[1] c->Process(“MySelec.C+”);
// Create a chain of treesroot[0] TChain *c = CreateMyChain();
// Start PROOF and tell the chain// to use itroot[1] TProof::Open(“masterURL”);root[2] c->SetProof()
// Process goes via PROOFroot[3] c->Process(“MySelec.C+”);
PROOFLocal ROOT
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 66
Classic bClassic batch processingatch processing
StorageBatch farm
queues
manager
catalog
query
submit
files
jobs
data file splitting
myAna.C
mergingfinal analysis
static use of resources jobs frozen: 1 job / worker node
external splitting, merging
outputs
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 77
PROOF processingPROOF processingcatalog StoragePROOF farm
schedulerquery
MASTER
PROOF job:data file list, myAna.C
files
final outputs
(merged)feedbacks (merged)
farm perceived as extension of local PC same syntax as in local session
more dynamic use of resources real time feedback automated splitting and merging
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 88
Challenges for PROOFChallenges for PROOF
• Remain efficient under heavy loadRemain efficient under heavy load
• 100% exploitation of resources100% exploitation of resources
• ReliabilityReliability
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 99
Levels of schedulingLevels of scheduling
• The packetizerThe packetizer– Load balancing on the level of a jobLoad balancing on the level of a job
• Resource scheduling Resource scheduling
(assigning resources to different (assigning resources to different jobs)jobs)– Introducing a central schedulerIntroducing a central scheduler– Priority based scheduling on worker Priority based scheduling on worker
nodesnodes
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1010
Packetizer’s Packetizer’s rolerole
• Lookup – check locations of all files and Lookup – check locations of all files and initiate staging, if neededinitiate staging, if needed
• Workers contact packetizer and ask for Workers contact packetizer and ask for new packets (pull architecture)new packets (pull architecture)
• A Packet has info onA Packet has info on– which file to openwhich file to open– which part of file to processwhich part of file to process
• Packetizer keeps assigning packets until Packetizer keeps assigning packets until the dataset is processedthe dataset is processed
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1111
PROOF dynamic load PROOF dynamic load balancingbalancing• Pull architecture guarantees scalabilityPull architecture guarantees scalability
• Adapts to variations in performance Adapts to variations in performance
Worker 1 Worker NMaster
packet:unit of work distribution
Time
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1212
TPacketizer: the original TPacketizer: the original packetizerpacketizer• StrategyStrategy
– Each worker processes its local files and Each worker processes its local files and then processes remaining remote filesthen processes remaining remote files
– Fixed size packetsFixed size packets– Avoid overloading data server by Avoid overloading data server by
allowing max 4 remote files to be servedallowing max 4 remote files to be served
• Problems with Problems with the Tthe TPacketizerPacketizer– Long tails Long tails with with some I/O bound jobssome I/O bound jobs
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1313
Performance tests with Performance tests with ALICEALICE• 35 PCs, dual Xeon 2.8 Ghz, ~200 GB disk35 PCs, dual Xeon 2.8 Ghz, ~200 GB disk
– Standard CERN hardware for LHCStandard CERN hardware for LHC
• Machine pools managed by xrootdMachine pools managed by xrootd– Data of Physics Data Challenge ’06 distributed Data of Physics Data Challenge ’06 distributed
(~ 1 M events)(~ 1 M events)
• Tests performedTests performed– SpeedupSpeedup (scalability) tests (scalability) tests– System response when running a System response when running a combination combination
of job types for increasing # of concurrent of job types for increasing # of concurrent usersusers
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1414
Example of problemsExample of problems w with some I/O ith some I/O bound jobsbound jobs
Processing rate during a query:
Resource utilization:
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1515
How to improveHow to improve
• Focus on I/O based jobsFocus on I/O based jobs– Limited by hard drive or network Limited by hard drive or network
bandwidthbandwidth
• Predict which Predict which datadata serverservers can s can become bottlenecksbecome bottlenecks
• Make sure that other workers help Make sure that other workers help analyzing data from those analyzing data from those serversservers
• Use tUse time-based packet sizeime-based packet sizess
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1616
TAdaptivePacketizerTAdaptivePacketizer
• StrategyStrategy– Predicting the processing timePredicting the processing time of of
local files for each workerlocal files for each worker– For the workers that are expected to For the workers that are expected to
finish faster, finish faster, keep assigning remote keep assigning remote files from the beginning of the jobfiles from the beginning of the job..
– Assign remote files from the most Assign remote files from the most heavily heavily loaded file loaded file serversservers
– VariableVariable packet size packet size
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1717
Improvement by up to 30%Improvement by up to 30%TPacketizer TAdaptivePacketizer
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1818
Scaling comparison for Scaling comparison for randomly distributed data setrandomly distributed data set
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 1919
Resource schedulingResource scheduling
• MotivationMotivation
• Central schedulerCentral scheduler– ModelModel– Interface Interface
• Priority based scheduling on worker Priority based scheduling on worker nodesnodes
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2020
Why scheduling?Why scheduling?
• Controlling resources and how they are Controlling resources and how they are usedused
• Improving efficiency Improving efficiency – assigning to a job those nodes that have data assigning to a job those nodes that have data
which needs to be analyzed.which needs to be analyzed.
• Implementing different scheduling policiesImplementing different scheduling policies– e.g. fair share, group priorities & quotase.g. fair share, group priorities & quotas
• Efficient use even in case of congestionEfficient use even in case of congestion
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2121
PROOF specific PROOF specific requirementsrequirements• Interactive systemInteractive system
– JJobs should be processed as soon as submitted.obs should be processed as soon as submitted.– However when max. system throughput is However when max. system throughput is
reached some jobs has to postponedreached some jobs has to postponed
• I/O bound jobs use more resources at the I/O bound jobs use more resources at the start and less at the end (file distribution)start and less at the end (file distribution)
• Try to process data locallyTry to process data locally• User defines a dataset not the #workersUser defines a dataset not the #workers• Possibility to remove/add workers during a Possibility to remove/add workers during a
jobjob
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2222
Starting a queryStarting a query withwith a central schedulera central scheduler (planed) (planed)
DatasetLookup
Client Master
ExternalScheduler
jobpacketizerpacketizer
Start workers
Clusterstatus
Userpriority,history
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2323
PlansPlans
• Interface for scheduling "per job”Interface for scheduling "per job”– Special functionality will allow to change Special functionality will allow to change
the set of nodes during a session the set of nodes during a session without loosing user libraries and other without loosing user libraries and other settingssettings
• Removing workers during a jobRemoving workers during a job
• Integration with a schedulerIntegration with a scheduler– MauiMaui, LSF, LSF??
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2424
Priority based scheduling on Priority based scheduling on nodesnodes
• Priority-based worker level load balancingPriority-based worker level load balancing– Simple and solid implementation, no central Simple and solid implementation, no central
unitunit– Group priorities defined in the configuration fileGroup priorities defined in the configuration file
• Performed on each worker node Performed on each worker node independentlyindependently
• Lower priority processes slowdownLower priority processes slowdown– sleep before next packet requestsleep before next packet request
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2525
SummarySummary
• The adaptive packetizer is working The adaptive packetizer is working very well in current environment. Will very well in current environment. Will be further tuned after introducing the be further tuned after introducing the schedulerscheduler
• Advanced work on PROOF interface Advanced work on PROOF interface to scheduler.to scheduler.
• Priority-based scheduling on nodes is Priority-based scheduling on nodes is being testedbeing tested
ACAT 23 - 27th of April 2ACAT 23 - 27th of April 2007007
Jan Iwaszkiewicz, CERN PH/SFTJan Iwaszkiewicz, CERN PH/SFT 2626
The PROOF TeamThe PROOF Team
• Maarten BallintijnMaarten Ballintijn
• Bertrand BellenotBertrand Bellenot
• Rene BrunRene Brun
• Gerardo GanisGerardo Ganis
• Jan IwaszkiewiczJan Iwaszkiewicz
• Andreas PetersAndreas Peters
• Fons RademakersFons Rademakers
http://root.cern.chhttp://root.cern.ch