47
Grid Computing Introduction Tuesday Morning Session Brian Bockelman, [email protected] OSG Staff University of Nebraska-Lincoln

Grid Computing Introduction Tuesday Morning Session Brian Bockelman, [email protected] OSG Staff University of Nebraska-Lincoln

Embed Size (px)

Citation preview

Page 1: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

Grid Computing IntroductionTuesday Morning Session

Brian Bockelman, [email protected]

OSG Staff

University of Nebraska-Lincoln

Page 2: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

Part I: Introduction to Grids

Page 3: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Outline

• Grid Computing, definitions and implementation.

• How security and handling works on the OSG.

• The OSG approach to grid computing.

3

Page 4: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Grid Computing

• Per usual, wikipedia offers a decent starting point: Grid computing is the combination of

computer resources from multiple administrative domains for a common goal.

• Grid computing is used to perform computations which may not be feasible otherwise. Reasons may be: Practical (one site can’t hold all the computers) Opportunistic (an organization wants to take

advantage of more computing resources) Political (multiple big sites working together).

4

Page 5: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Grid Computing

• Important aspects of the definition: “Combination of computing resources”:

Implies each resource can function separately. Overall, extra layer of difficulty to handle

compared to using a single resource (even if this is hidden from the end user!).

“Multiple administrative domains”: There must be some level of trust between the

user and sites. These trust relationships can be very complex!

5

Page 6: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Original Idea

• The original idea behind grid computing was to make computing power as easy to access as the electrical grid. You could take your job and plug it in to

“the grid”. Everyone can use the same interface. This metaphor also implied grids would be

as easy to use as the power system… (… which might have been a pipe dream)

6

Page 7: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

What makes it unique?

• Food for thought: What’s the difference between grid

computing and cloud computing? What’s the difference between grid

computing and the capabilities Condor provides?

7

Page 8: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Grids in the US

• The two largest grids in the US are the Teragrid and the OSG. Both are formed by taking traditional

computing sites and allowing users to access resources in a somewhat uniform manner.

Resources include: Unique supercomputers: IBM Blue Genes Linux clusters: Loosely-coupled Intel/Linux Data archives: Long-term tape storage Large-scale data systems: Distributed or clustered

file systems, providing hundreds of TB to multiple petabytes.

8

Page 9: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Teragrid, in a nutshell

• The Teragrid is formed by a small number (less than 10) of computing centers. These are some of the largest computing

centers in the world. Often multiple, unique resources per center.

Clearly favors a few incredibly powerful resources. By invite only.

Compute time is allocated by committee. Access via both grid protocols and ssh logins.

9

Page 10: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

The Open Science Grid

• The OSG is a grid formed by 80 sites across the US and the world. Most sites are small-to-medium Linux clusters,

with a few large clusters. Primary stakeholders are the LHC and LIGO.

Focus is on data-intensive, high-throughput processing (not “traditional supercomputing”).

Compute time is allocated by individual site policy. Strong emphasis on decentralization.

No SSH logins allowed to remote sites.

10

https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/WhatIsOSG

Page 11: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Map of OSG sites

11

To give you a feel for the distribution of the OSG sites…

Page 12: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Today’s Sites

• We’ll be using 3 different sites today. OSG-EDU: A small, educational resource here

at Wisconsin. <100 cores managed by Condor, only a few GB of NFS-based storage.

Nebraska: A medium-sized production site owned by CMS in Lincoln, NE. 1500 cores managed by Condor, about 350TB of Hadoop storage.

Firefly: A large production site owned by HCC in Omaha, NE. ~5000 cores managed by PBS, 150TB of Panasas storage.

12

Page 13: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

OSG Compute Resources

• The OSG CE is layered on top of a traditional batch system.

13

Site Cluster

OSG CE OSG CE

Batch System services

WN

WN

WN

WN

WN

WN

User

Page 14: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Inside the OSG CE

14

• The current core of the OSG CE is the Globus Toolkit. Important parts are: Globus GRAM: Translates jobs in RSL format to batch system jobs; allows

generic batch system commands to be translated to the site. GASS server: Used to stage small files in and out of the GridFTP server: Used to move large files in and out of the host.

Page 15: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Anatomy of a Grid Job

• All job creations will have the following steps: User creates job and determines which OSG CE it

will be sent to. User submits job description and files to Globus

on the OSG CE. Globus converts the job to a batch system job,

and submits that. Job starts running on the worker node.

• Job finishing goes up in the reverse direction.• For the user to know the job status, there are 3

different systems (user, Globus, batch system) that must be in sync.

15

Page 16: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

OSG CE, In Summary

• The core of the CE is the Globus middleware (using the GT2 protocol). Globus allows an abstract interface to your system

to be exported to the world. Provides the means to allow users to submit to

multiple administrative domains.

• The CE interacts with your cluster’s batch system; grid jobs are converted to and run as batch system jobs. The component doing the conversion is called the

“Job Manager”. Condor sites use “jobmanager-condor”, PBS sites use “jobmanager-pbs”, etc.

16

Page 17: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

And everything else

• There are quite a few more OSG components: Monitoring Accounting Information services Storage and transfer

• Which I will not be covering in this talk.• You’ll learn them throughout the week

(assuming you don’t skip class).

17

Page 18: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Review

• Grid computing allows one to utilize multiple computing resources from multiple administrative domains. This is more complex than traditional batch

systems.

• OSG is one implementation of a grid; its technology is based upon the Globus Toolkit and Condor. It has almost 100 computing resources

(clusters) and 50 storage resources.

18

Page 19: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

Part II: Trust Relationships

Page 20: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Trust Relationships

• What kind of trust relationships do we encounter in the airport? Passports Tickets “Secured area” inside terminal

20

Page 21: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Trust Relationships in the Grid

• In the grid, we usually think of the users/organizations as the consumers and the sites as the producers. How does the site know you are who you

say you are? How does the site know you are allowed to

submit jobs?

21

Page 22: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Compare to the Cloud

• What kind of trust relationship is there in the Amazon EC2 cloud? You trust a SSL connection with the Amazon

SSL certificate has Amazon on the other side. Amazon allows you to use the compute

resources if your credit card number is valid. You trust Amazon gives you a certain amount

of computing for your money. What else?

• Note the trust is 2 way!

22

Page 23: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

The one you forgot about!

• How do you trust the site? After all, this is your data! How do you know they

aren’t going to steal your Ph.D. thesis? Your new novel protein? (How is this different from any case of using computing

you don’t own?) The answer is that you trust the organization running the

resource. Often, you trust the OSG to only allow reputable

organizations to join. In this case, the trust relationship is based on

society, not technology. Keep this in mind – the societal aspects are equally as

important as the technology sometimes. Read “Reflections on Trusting Trust”!

23

Page 24: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Authn and Authz

• In order to establish trust relationships on the grid, two things need to happen: Authentication (authn): The process of

establishing an identity for your job. Authorization (authz): Determining that

your job is allowed to run at the site.

• Think: What authentication and authorization need to happen at an airport?

24

Page 25: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

X509 and GSI Security

• In the OSG, authentication happens using a grid certificate. This is simply a personal SSL certificate. The grid certificate is signed by a trusted

authority – the Certificate Authority – and vouches for your identity.

When you need to temporary delegate your rights to elsewhere – like to a remote job – you can use your certificate to form a proxy certificate. This authenticates a grid job as belonging to you.

Any grid job with your proxy you get the blame for.

25Your grid certificate identifies who you are!

Page 26: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Authorization on the OSG

• It would be very hard for sites to authorize each user independently (think: CMS has around 2000).

• Instead, each site authorizes the organizations they want to partner with to run at their site. And the organization securely informs the site

who is in their organization. Because these organizations don’t always

deal with a physical entity (like a single campus or lab), they’re referred to as a “virtual organization” or a VO.

26

Page 27: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Take Home Message

• You are identified by your certificate. This is your authentication.

• You must join a VO to use the OSG.• A site makes authorization decisions

based upon a VO.• With this model, we tend to minimize the

number of communications between the site, user, and organization.

• The OSG implements authorization and authentication on top of x509 certificates and PKI.

27

Page 28: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

Segway: Hands-on with security

Page 29: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Hands-On with Security

• Bounce on over to the following URL: https://twiki.grid.iu.edu/bin/view/Education/

GlobusToolsOss2010

29

Page 30: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Hands-On with Security

• You should have learned… That you have a personal certificate.

You know your DN and CN.

That you are a member with of an OSG VO (osgedu, at least).

How to create a certificate. The difference between a plain “grid” certificate

and a “voms” certificate.

How to test Globus authentication. And a few common error messages!

30

Page 31: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

Segway: Globus Tools

Page 32: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Hands-On with Globus Tools

• You should have… Run a single executable and a script. Explored the difference between the

different jobmanagers. Thought about what it would take to

manage HTC-type grid jobs.

32

Page 33: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

Part III: Condor-G

Page 34: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Condor-G

• Ok, we know what the grid is… … and how to authenticate … … how the heck do we use it effectively?

• There is an added complexity in using multiple sites, but many things remain the same: Job submission Checking job status Cancelling jobs Managing input/output Managing job dependencies

34

Page 35: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Observation

• The requirements for the grid and for the batch system look about the same!

• Key insight: We can reuse a large portion of our batch system to command and control grid jobs. In fact, this means we can present familiar

interfaces to the user!

• Hence, Condor-G was born.

35

Page 36: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Normal Condor

36

Page 37: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Condor-G

37

Page 38: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Condor vanilla vs Condor-G

• In Condor, the current state of the batch slot is represented by the shadow. Negotiation provides matchmaking down to

the best slot. Schedd provides queue management and

presents a “batch system” interface to the user. Works with shadow and negotiator.

• In the grid universe, a “gridmanager” process is spawned which performs the different queue actions.

38

Page 39: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Gridmanagers

• There is one gridmanager type per grid flavor: Globus Nordugrid EC2 (Clouds) PBS.

• PBS isn’t normally thought of as a grid, but Condor-G / the “grid universe” is just a way to interface the Condor with external batch-like systems.

39

Page 40: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Gridmanagers

• For example, the GT2 grid manager will take a job from the schedd and use the job’s ClassAd to submit the job via a Globus C library. If the job is successfully submitted, the

gridmanager process will update the Schedd accordingly.

• Periodically, the gridmanager will poll all the jobs at a site, get their latest status, and update the schedd queue.

40

Page 41: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Condor vs Condor-G vs Globus

Action Condor Condor-G Globus

Submit job condor_submit condor_submit globus-job-submit

Query Status condor_status condor_status globus-job-status

Cancel job condor_rm condor_rm globus-job-cancel

Job Description ClassAd ClassAd RSL

41

• All your queue and job management can be done by condor.• Familiar interface and description language.

• Tools which know how to interact with Condor can interact cleanly with grid jobs.• There may be no “standard grid protocol”, but Condor-G is

almost the defacto standard grid client.• Allows effective use even while the server technology is

evolving (Keep up with the hype cycle: Utility computing -> grid computing -> cloud computing; all can use Condor)

Page 42: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Workflow on the Grid

• See: Workflows on Condor, i.e. DAGMan. Because DAGMan layers on top of the

Condor schedd… And we put our grid jobs into Condor-G… All of our workflow methodology converts

to the grid with no changes.

42

Page 43: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Condor-G Details

• Always use Condor-G on the OSG; Globus tools are extremely unscalable and will crash the remote site at about 100 jobs.

• Condor-G is activated when you set the Condor job’s universe to “grid”

• You also need to specify what endpoint and grid flavor using “grid_resource” in your submit file.

43

Page 44: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Condor-G Details

• To use Condor-G to submit to a Globus resource, add the following 2 lines to your submit file: universe=grid grid_resource = gt2

osg-edu.cs.wisc.edu:/jobmanager-condor

• This submits to the OSG-EDU resource. A few other sample endpoints: ff-grid.unl.edu:/jobmanager-pbs (Firefly) red.unl.edu:/jobmanager-condor (Nebraska)

44

Page 45: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Condor-G “Gotchas”

• Assume no shared file system. Know your input and output files and let Condor handle

the file movement.• Never run compute jobs on “jobmanager” or

“jobmanager-fork” (anyone recall why?)• Grid is NOT uniform.

Different architectures, OS, execution environment. Different policies, restrictions, site preferences. Condor-G does not make any of these magically

heterogeneous. On the OSG, for Condor-G, you are expected to

provide a wrapper to make the execution environment what you need.

45

Page 46: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Review

• You should have learned: Grids have needs similar to batch systems. Condor schedd provides all the job/queue

management that one needs for the grid. Condor-G is activated when you use the grid

universe in Condor. The normal Condor components are replaced by

the “gridmanager” process, which translates the Condor commands to grid commands.

By using Condor-G, you can reuse the (large) Condor infrastructure on many grids, including the OSG.

46

Page 47: Grid Computing Introduction Tuesday Morning Session Brian Bockelman, bbockelm@cse.unl.edu OSG Staff University of Nebraska-Lincoln

OSG Summer School 2010

Questions?

• Questions? Comments?• Feel free to ask me questions later:

Brian Bockelman, [email protected]

• Upcoming sessions: Afternoon, after lunch:

Handling large-scale data. Room 2310

47