Click here to load reader

June 21-25, 2004Lecture4: Grid Data Management1 Lecture 4 Grid Data Management Jaime Frey UW-Madison Condor Group [email protected] Slides prepared in

  • View
    216

  • Download
    0

Embed Size (px)

Text of June 21-25, 2004Lecture4: Grid Data Management1 Lecture 4 Grid Data Management Jaime Frey UW-Madison...

  • Lecture 4Grid Data ManagementJaime FreyUW-Madison Condor [email protected] Slides prepared in part by Scott Koranda UW-Milwaukee & [email protected] Grid Summer Workshop June 21-25, 2004

    Lecture4: Grid Data Management

  • Motivation?Why is the Grid community concerned with data/file management?Why might you be concerned with data/file management?

    Lecture4: Grid Data Management

  • Motivation: The Data ProblemMotivate our discussion with the large physics experiments (part of GriPhyN and Grid2003)Laser Interferometer Gravitational Wave ObservatoryDetect spacetime ripples from blackholes & other sourcesGenerates data at 10 MB per second, just under 1 TB per daySloan Digital Sky SurveyCatalog more stars and galaxies then ever beforeMore than 15 TB of data catalogsCompact Muon Solenoid and ATLASDetect the Higgs Boson (a fundamental particle)100 MB per second, about 1 Petabyte per year (per detector)

    Lecture4: Grid Data Management

  • Really Two Data ProblemsThe amount of dataHigh-performance tools needed to manage the huge raw volume of dataStore itMove itMeasure in terabytes, petabytes, and ???The number of data filesHigh-performance tools needed to manage the huge number of filenames1012 filenames is expected soonCollection of 1012 of anything is a lot to handle efficiently

    Lecture4: Grid Data Management

  • Three Data Questions on the GridEssentially three (3) questions for which you want Grid tools to address

    What data/files exist?What data/files are where?How do I move data/files from A to B?

    Lecture4: Grid Data Management

  • Three Data Questions on the GridExamine these questions last to firstbecause even if you dont have TBs of data you will want to move files so start with #3

    What data/files exist?What data/files are where?How do I move data/files from A to B?

    Lecture4: Grid Data Management

  • How to move data/files?RequirementsFast as fast as networks and protocols allowI2 sites should expect at least 10 MB/s sustainedSecureServer must only share files with strongly authenticated clientsNo passwords in the clear or similarRobustFault tolerant, time-tested protocol

    Lecture4: Grid Data Management

  • GridFTP Extension to well known File Transfer Protocol (FTP)http://www.globus.org/datagrid/deliverables/C2WPdraft3.pdfExtensions includeStrong authentication, encryption via Globus GSIMultiple, parallel data channelsThird-party transfersTunable network & I/O parametersServer side processing, command pipelining

    Lecture4: Grid Data Management

  • Necessary SemanticsGridFTP is the protocolA server or client that implements the GridFTP protocol is GridFTP-enabled or Grid-enabledOften hear the GridFTP server or the GridFTP clientCorrect is the GridFTP-enabled server from the Globus team or the particular client being usedLet it slideeasier to use the slangbutDistinction more important soon as groups outside of Globus release GridFTP-enabled clients & servers

    Lecture4: Grid Data Management

  • GridFTP ServerBuilt on top of wuftpd, our old friendA brand new server from scratch in beta nowMost configuration details same as wuftpdRuns as a inetd (xinetd) serviceConnection is attempted on port 2811Xinetd looks up port in /etc/services and finds responsible serviceXinetd starts service according to configuration with data from communication send on stdin

    Lecture4: Grid Data Management

  • GridFTP ServerFrom /etc/services[services]$ tail /etc/services gsiftp 2811/tcp #Grid-FTP Serverglobus-gatekeeper 2119/tcp #Globus Gatekeeper

    From /etc/xinetd.d/[xinetd.d]$ cat gsiftpservice gsiftp{ socket_type = stream protocol = tcp env = LD_LIBRARY_PATH=/opt/ldg-2.0/globus/lib wait = no user = root server = /opt/ldg-2.0/globus/sbin/in.ftpd server_args = -l -a -G /opt/ldg-2.0/globus log_on_success += DURATION USERID log_on_failure += USERID nice = 10 disable = no}

    Lecture4: Grid Data Management

  • GridFTP ServerEnvironment variablesLD_LIBRARY_PATHPoint to $GLOBUS_LOCATION/libGRIDMAPPath to grid-mapfile for authenticationGeneric GSI environment variableX509_CERT_DIRDirectory in which CA signing certificates heldGeneric GSI environment variable

    Lecture4: Grid Data Management

  • GridFTP ServerLogging to system log On most Linux /var/log/messagesJun 10 10:46:59 basil gridftpd[21857]: GSSAPI user /DC=org/DC=doegrids/OU=People/CN=Scott Koranda 43845 is authorized as skorandaJun 10 10:46:59 basil gridftpd[21857]: FTP LOGIN FROM oregano.phys.uwm.edu [129.89.57.55], skoranda

    Uses host certificate for mutual authentication[[email protected] root]# grid-cert-info -file /etc/grid-security/hostcert.pem -subject/DC=org/DC=doegrids/OU=Services/CN=basil.phys.uwm.edu

    Lecture4: Grid Data Management

  • GridFTP ServerThird-party transfersClient directs transfers between two servers

    move file1 to ldas-cit.ligo.caltech.edufile1

    Lecture4: Grid Data Management

  • GridFTP clientsGlobus-url-copyGridFTP-compliant client from the Globus teamCopy files from one URL to another URLOne URL is usually a gsiftp:// URLAnother URL is usually a file:/ URLTo move a file from remote GridFTP-enabled server to local machineglobus-url-copy gsiftp://dataserver.phys.uwm.edu/data/file1 file:/home/skoranda/file1

    Lecture4: Grid Data Management

  • Globus-url-copyAlternative forms for file:/ URLsglobus-url-copy gsiftp://dataserver.phys.uwm.edu/data/file1 file://localhost/home/skoranda/file1globus-url-copy gsiftp://dataserver.phys.uwm.edu/data/file1 file://basil.phys.uwm.edu/home/skoranda/file1

    If GridFTP server runs on a non-standard port?globus-url-copy gsiftp://dataserver.phys.uwm.edu:15000/data/file1 file:/home/skoranda/file1

    Lecture4: Grid Data Management

  • Globus-url-copyTo put file onto server reverse URLsglobus-url-copy file:/home/skoranda/file1 gsiftp://dataserver.phys.uwm.edu/data/file1By default 1 data channel usedaverage performancemonitor performance using vb flag

    $ globus-url-copy -vb gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/smallfile file:/tmp/smallfile 9437184 bytes 658.09 KB/sec avg 512.95 KB/sec inst

    Lecture4: Grid Data Management

  • Going fastMultiple channels dramatically boosts xfer rate$ globus-url-copy -vb -p 4 gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile 523960320 bytes 5814.25 KB/sec avg 5568.27 KB/sec inst

    Still faster by using large TCP windows$ globus-url-copy -vb -p 4 -tcp-bs 1048576 gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile 514392064 bytes 6609.67 KB/sec avg 8639.71 KB/sec inst

    Still faster by using large memory buffers$ globus-url-copy -vb -p 4 -bs 1048576 -tcp-bs 1048576 gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile 523304960 bytes 7300.56 KB/sec avg 9311.99 KB/sec inst

    Lecture4: Grid Data Management

  • Faster!Depending on network & weather you can go very fast!$ globus-url-copy -vb -p 8 -bs 1048576 -tcp-bs 1048576 gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile 185270272 bytes 18092.57 KB/sec avg 25153.96 KB/sec inst

    Lecture4: Grid Data Management

  • Third-party transfersTransfers from server to server directed by clientUse gsiftp:// URLs for bothrequires both servers be configured to allow 3rd party$ hostname basil.phys.uwm.edu$ globus-url-copy gsiftp://hydra.phys.uwm.edu/tmp/file1 gsiftp://contra.phys.uwm.edu/tmp/file1

    Lecture4: Grid Data Management

  • DebuggingUse dbg to see control channel communication$ globus-url-copy -dbg gsiftp://hydra.phys.uwm.edu/tmp/file1 file:/tmp/file1debug: starting to get gsiftp://hydra.phys.uwm.edu/tmp/file1debug: connecting to gsiftp://hydra.phys.uwm.edu/tmp/file1debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1:220 hydra.phys.uwm.edu GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42) ready. debug: authenticating with gsiftp://hydra.phys.uwm.edu/tmp/file1debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1:230 User skoranda logged in. debug: sending command:FEAT debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1:211-Extensions supported: REST STREAM ESTO ERET MDTM SIZE PARALLEL DCAU211 END

    Lecture4: Grid Data Management

  • Globus-url-copyAcutally a general purpose URL copying toolNo GSI authentication usedParallel channels and like wont work

    $ globus-url-copy http://www.yahoo.com file:/tmp/yahoo

    $ globus-url-copy ftp://ftp.globus.org/banner.msg file:/tmp/banner.msg

    Lecture4: Grid Data Management

  • GridFTP clientsUberFTPdeveloped and supported at National Center for Supercomputing Applications (NCSA)interactive like our old (insecure) friend ftpuse a GSI for GSI authenticationsupports multiple channels using c flag$ uberftp -H hydra.phys.uwm.edu -a GSI220 hydra.phys.uwm.edu GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42) ready.230 User skoranda logged in.uberftp>

    Lecture4: Grid Data Management

  • GridFTP clientsRoll your ownAdd functionality directly to your applicationsYour application find and download its own data?Your application deliver output data files when finished computing?Globus Toolkit offers APIs to code againstC JavaPython

    Lecture4: Grid Data Management

  • GridFTP and FirewallsNice document b