27
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014

PanDA Status Report

  • Upload
    palma

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

PanDA Status Report. Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014. Overview. We are nearing end of ANSE project ~6 months Review goals/scope of PanDA work in ANSE Assess progress so far PanDA work started ~1 year ago Plans for completion of current work - PowerPoint PPT Presentation

Citation preview

Page 1: PanDA  Status Report

PanDA Status Report

Kaushik DeUniv. of Texas at Arlington

ANSE Meeting, NashvilleMay 13, 2014

Page 2: PanDA  Status Report

Overview

We are nearing end of ANSE project ~6 months Review goals/scope of PanDA work in ANSE Assess progress so far

PanDA work started ~1 year ago Plans for completion of current work Plans for new work

Discuss tomorrow Synergy with other projects

Artem is co-funded by DOE-ASCR BigPanDA project BigPanDA continues for ~9 months after ANSE ends What happens after 2015?

May 13, 2014Kaushik De 2

Page 3: PanDA  Status Report

PanDA Goals

Explicit integration of Networking with PanDA Never before attempted for any WMS PanDA has many implicit assumptions about networking Goal 1: Use network information directly in PanDA workflow Goal 2: Attempt direct control (provisioning) through PanDA

ANSE + DOE-ASCR Picked few well defined topics Set up infrastructure and interactions with other projects Develop and deploy software Evaluation metrics

Deliver new capabilities for LHC experiments This is not only R&D – use in production environment

May 13, 2014Kaushik De 3

Page 4: PanDA  Status Report

PanDA Steps

Collect network information Storage and access Using network information Using dynamic circuits

May 13, 2014Kaushik De 4

Page 5: PanDA  Status Report

Sources of Network Information

DDM Sonar measurements Actual transfer rates for files between all sites (Tier 1 and Tier 2) This information is normally used for site white/blacklisting Measurements available for small, medium, and large files

perfSonar (PS) measurements perfSonar provides dedicated network monitoring data All WLCG sites are being instrumented with PS boxes US sites are already instrumented and monitored

Federated XRootD (FAX) measurements Read-time of remote files are measured for pairs of sites

This is not an exclusive list – just a starting point

May 13, 2014Kaushik De 5

http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Sonar&highlight=false

Page 6: PanDA  Status Report

DDM Sonar

May 13, 2014Kaushik De 6

Page 7: PanDA  Status Report

perfSonar

May 13, 2014Kaushik De 7

Page 8: PanDA  Status Report

FAX

May 13, 2014Kaushik De 8

Page 9: PanDA  Status Report

May 13, 2014Kaushik De 9

Page 10: PanDA  Status Report

Data Repositories

Three levels of data storage and access Native data repositories

Historical data stored from collectors SSB – site status board for sonar and perfSonar data FAX data is kept independently and uploaded

AGIS (ATLAS Grid Information System) Most recent / processed data only – updated periodically Mixture of push/pull – moving to JSON API (pushed only)

schedConfigDB Internal Oracle DB used by PanDA for fast access Uses standard ATLAS collector

May 13, 2014Kaushik De 10

Page 11: PanDA  Status Report

May 13, 2014Kaushik De 11

Page 12: PanDA  Status Report

Using Network Information

Pick a few use cases Important to PanDA users Enhance workload management through use of network Should provide clear metrics for success/failure

Case 1: Improve User Analysis workflow Case 2: Improve Tier 1 to Tier 2 workflow

May 13, 2014Kaushik De 12

Page 13: PanDA  Status Report

Improving User Analysis

In PanDA, user jobs go to data Typically, user jobs are IO intensive – hence constrain jobs to data Note - almost any user payload is allowed by PanDA User analysis jobs are routed automatically to T1/T2 sites

For popular data, bottlenecks develop If data is only at a few sites, user jobs have long wait times PD2P was implemented 3 years ago to solve this problem Additional copies are made asynchronously by PanDA Waiting jobs are automatically re-brokered to new sites But bottlenecks still take time to clear up

Can we do something else using network information? Why not use FAX? First we need to develop network metrics for efficient use of FAX

May 13, 2014Kaushik De 13

Page 14: PanDA  Status Report

Faster User analysis through FAX

First use case for network integration with PanDA PanDA brokerage will use concept of ‘nearby’ sites

Calculate weight based on usual brokerage criteria (availability of CPU, release, pilot rate…)

Add network transfer cost to brokerage weight Jobs will be sent to the site with best weight – not necessarily the

site with local data If nearby site has less wait time, access the data through FAX

May 13, 2014Kaushik De 14

Page 15: PanDA  Status Report

First Tests

Tested in production for ~1 day in March, 2014 Useful for debugging and tuning direct access infrastructure We got first results on network aware brokerage

Job distribution 4748 jobs from 20 user tasks which required data from congested

U.S. Tier 1 site were automatically brokered to U.S. Tier 1/2 sites

May 13, 2014Kaushik De 15

120417 555

837

408660366

558123030

4173030

12830 30 30 30 30

Number of Jobs per Task

Page 16: PanDA  Status Report

Brokerage Results

May 13, 2014Kaushik De 16

553 566 568 569 570 571 573 574 598 605 615 617 622 640 647 655 662 665 668 6811

10

100

1000

10000

FAX/non-FAX Ratio

# of Local Jobs

# of Remote Jobs

Task Number

553 566 568 569 570 571 573 574 598 605 615 617 622 640 647 655 662 665 668 6810

100

200

300

400

500

600

700

Job Wait Times

Local Jobs Wait Time

Remote Jobs Wait Time

Task Number

Page 17: PanDA  Status Report

Conclusions for Case 1

Network data collection working well Additional algorithms to combine network data will be tried HC tests working well – but PS data not robust yet

PanDA brokerage worked well Achieved goal of reducing wait time Well balanced local vs remote access Will fine tune after more data on performance

Waiting for final implementation But we have no data on actual performance of successful jobs Need to test and validate sites for this mode of data access First tests in March had 100% failure rate (FAX deployment related) Second test 1 week ago also did not go well Expect third test soon

May 13, 2014Kaushik De 17

Page 18: PanDA  Status Report

Managing Data Rates

Tests have shown direct access rates need to be managed Parameters for WAN throttling implemented in PanDA

Throttling at brokerage level is easy (eg. ratio FAX jobs/non FAX jobs), but does not guarantee throttling during execution

Throttling during dispatch is not scalable when million jobs are dispatched daily (scale may be higher in the future)

Throttling may also be done at pilot level PanDA has implemented a mixed approach to throttling, being

tested now

May 13, 2014Kaushik De 18

Page 19: PanDA  Status Report

Cloud Selection

Second use case for network integration with PanDA Optimize choice of T1-T2 pairings (cloud selection)

In ATLAS, production tasks are assigned to Tier 1’s Tier 2’s are attached to a Tier 1 cloud for data processing Any T2 may be attached to multiple T1’s Currently, operations team makes this assignment manually This could/should be automated using network information For example, each T2 could be assigned to a native cloud by

operations team, and PanDA will assign to other clouds based on network performance metrics

May 13, 2014Kaushik De 19

Page 20: PanDA  Status Report

DDM Sonar Data

May 13, 2014Kaushik De 20

http://aipanda021.cern.ch/networking/t1tot2d_matrix/

Page 21: PanDA  Status Report

Tier 1 View

May 13, 2014Kaushik De 21

Page 22: PanDA  Status Report

More T1 Information

May 13, 2014Kaushik De 22

Page 23: PanDA  Status Report

Tier 2 View

May 13, 2014Kaushik De 23

Page 24: PanDA  Status Report

Improving Site Association

May 13, 2014Kaushik De 24

Page 25: PanDA  Status Report

More T2 Information

May 13, 2014Kaushik De 25

Page 26: PanDA  Status Report

Conclusion for Case 2

Working well in real time Currently implementing archival information

Keep data for last ‘n’ Tier 1 – Tier 2 associations Necessary to check robustness of approach Algorithm may use the historical information in the future

Expect to deploy this summer Hopefully ~1 month

May 13, 2014Kaushik De 26

Page 27: PanDA  Status Report

Summary

First 2 use cases for network integration with PanDA working well Work will be completed this summer Metrics showing usefulness of approach will be available in Fall On track for timely final report to ANSE

May 13, 2014Kaushik De 27