Upload
palma
View
50
Download
0
Tags:
Embed Size (px)
DESCRIPTION
PanDA Status Report. Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014. Overview. We are nearing end of ANSE project ~6 months Review goals/scope of PanDA work in ANSE Assess progress so far PanDA work started ~1 year ago Plans for completion of current work - PowerPoint PPT Presentation
Citation preview
PanDA Status Report
Kaushik DeUniv. of Texas at Arlington
ANSE Meeting, NashvilleMay 13, 2014
Overview
We are nearing end of ANSE project ~6 months Review goals/scope of PanDA work in ANSE Assess progress so far
PanDA work started ~1 year ago Plans for completion of current work Plans for new work
Discuss tomorrow Synergy with other projects
Artem is co-funded by DOE-ASCR BigPanDA project BigPanDA continues for ~9 months after ANSE ends What happens after 2015?
May 13, 2014Kaushik De 2
PanDA Goals
Explicit integration of Networking with PanDA Never before attempted for any WMS PanDA has many implicit assumptions about networking Goal 1: Use network information directly in PanDA workflow Goal 2: Attempt direct control (provisioning) through PanDA
ANSE + DOE-ASCR Picked few well defined topics Set up infrastructure and interactions with other projects Develop and deploy software Evaluation metrics
Deliver new capabilities for LHC experiments This is not only R&D – use in production environment
May 13, 2014Kaushik De 3
PanDA Steps
Collect network information Storage and access Using network information Using dynamic circuits
May 13, 2014Kaushik De 4
Sources of Network Information
DDM Sonar measurements Actual transfer rates for files between all sites (Tier 1 and Tier 2) This information is normally used for site white/blacklisting Measurements available for small, medium, and large files
perfSonar (PS) measurements perfSonar provides dedicated network monitoring data All WLCG sites are being instrumented with PS boxes US sites are already instrumented and monitored
Federated XRootD (FAX) measurements Read-time of remote files are measured for pairs of sites
This is not an exclusive list – just a starting point
May 13, 2014Kaushik De 5
http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Sonar&highlight=false
DDM Sonar
May 13, 2014Kaushik De 6
perfSonar
May 13, 2014Kaushik De 7
FAX
May 13, 2014Kaushik De 8
May 13, 2014Kaushik De 9
Data Repositories
Three levels of data storage and access Native data repositories
Historical data stored from collectors SSB – site status board for sonar and perfSonar data FAX data is kept independently and uploaded
AGIS (ATLAS Grid Information System) Most recent / processed data only – updated periodically Mixture of push/pull – moving to JSON API (pushed only)
schedConfigDB Internal Oracle DB used by PanDA for fast access Uses standard ATLAS collector
May 13, 2014Kaushik De 10
May 13, 2014Kaushik De 11
Using Network Information
Pick a few use cases Important to PanDA users Enhance workload management through use of network Should provide clear metrics for success/failure
Case 1: Improve User Analysis workflow Case 2: Improve Tier 1 to Tier 2 workflow
May 13, 2014Kaushik De 12
Improving User Analysis
In PanDA, user jobs go to data Typically, user jobs are IO intensive – hence constrain jobs to data Note - almost any user payload is allowed by PanDA User analysis jobs are routed automatically to T1/T2 sites
For popular data, bottlenecks develop If data is only at a few sites, user jobs have long wait times PD2P was implemented 3 years ago to solve this problem Additional copies are made asynchronously by PanDA Waiting jobs are automatically re-brokered to new sites But bottlenecks still take time to clear up
Can we do something else using network information? Why not use FAX? First we need to develop network metrics for efficient use of FAX
May 13, 2014Kaushik De 13
Faster User analysis through FAX
First use case for network integration with PanDA PanDA brokerage will use concept of ‘nearby’ sites
Calculate weight based on usual brokerage criteria (availability of CPU, release, pilot rate…)
Add network transfer cost to brokerage weight Jobs will be sent to the site with best weight – not necessarily the
site with local data If nearby site has less wait time, access the data through FAX
May 13, 2014Kaushik De 14
First Tests
Tested in production for ~1 day in March, 2014 Useful for debugging and tuning direct access infrastructure We got first results on network aware brokerage
Job distribution 4748 jobs from 20 user tasks which required data from congested
U.S. Tier 1 site were automatically brokered to U.S. Tier 1/2 sites
May 13, 2014Kaushik De 15
120417 555
837
408660366
558123030
4173030
12830 30 30 30 30
Number of Jobs per Task
Brokerage Results
May 13, 2014Kaushik De 16
553 566 568 569 570 571 573 574 598 605 615 617 622 640 647 655 662 665 668 6811
10
100
1000
10000
FAX/non-FAX Ratio
# of Local Jobs
# of Remote Jobs
Task Number
553 566 568 569 570 571 573 574 598 605 615 617 622 640 647 655 662 665 668 6810
100
200
300
400
500
600
700
Job Wait Times
Local Jobs Wait Time
Remote Jobs Wait Time
Task Number
Conclusions for Case 1
Network data collection working well Additional algorithms to combine network data will be tried HC tests working well – but PS data not robust yet
PanDA brokerage worked well Achieved goal of reducing wait time Well balanced local vs remote access Will fine tune after more data on performance
Waiting for final implementation But we have no data on actual performance of successful jobs Need to test and validate sites for this mode of data access First tests in March had 100% failure rate (FAX deployment related) Second test 1 week ago also did not go well Expect third test soon
May 13, 2014Kaushik De 17
Managing Data Rates
Tests have shown direct access rates need to be managed Parameters for WAN throttling implemented in PanDA
Throttling at brokerage level is easy (eg. ratio FAX jobs/non FAX jobs), but does not guarantee throttling during execution
Throttling during dispatch is not scalable when million jobs are dispatched daily (scale may be higher in the future)
Throttling may also be done at pilot level PanDA has implemented a mixed approach to throttling, being
tested now
May 13, 2014Kaushik De 18
Cloud Selection
Second use case for network integration with PanDA Optimize choice of T1-T2 pairings (cloud selection)
In ATLAS, production tasks are assigned to Tier 1’s Tier 2’s are attached to a Tier 1 cloud for data processing Any T2 may be attached to multiple T1’s Currently, operations team makes this assignment manually This could/should be automated using network information For example, each T2 could be assigned to a native cloud by
operations team, and PanDA will assign to other clouds based on network performance metrics
May 13, 2014Kaushik De 19
DDM Sonar Data
May 13, 2014Kaushik De 20
http://aipanda021.cern.ch/networking/t1tot2d_matrix/
Tier 1 View
May 13, 2014Kaushik De 21
More T1 Information
May 13, 2014Kaushik De 22
Tier 2 View
May 13, 2014Kaushik De 23
Improving Site Association
May 13, 2014Kaushik De 24
More T2 Information
May 13, 2014Kaushik De 25
Conclusion for Case 2
Working well in real time Currently implementing archival information
Keep data for last ‘n’ Tier 1 – Tier 2 associations Necessary to check robustness of approach Algorithm may use the historical information in the future
Expect to deploy this summer Hopefully ~1 month
May 13, 2014Kaushik De 26
Summary
First 2 use cases for network integration with PanDA working well Work will be completed this summer Metrics showing usefulness of approach will be available in Fall On track for timely final report to ANSE
May 13, 2014Kaushik De 27