21
1 Alexandru V Staicu 1 , Jacek R. Radzikowski Gaj 1 , Nikitas Alexandridis 2 , Tarek El-Gha 1 George Mason University 2 George Washington University Effective Use of Networked Reconfigurable Resources http://ece.gmu.edu/lucite

1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

Embed Size (px)

Citation preview

Page 1: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

1

Alexandru V Staicu1, Jacek R. Radzikowski1

Kris Gaj1, Nikitas Alexandridis2, Tarek El-Ghazawi2

1 George Mason University2 George Washington University

Effective Use of Networked Reconfigurable Resources

http://ece.gmu.edu/lucite

Page 2: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

2

Problem:

• Reconfigurable resources expensive and underutilized

• Many of these resources available over the network

• It is desirable to leverage networked reconfigurable resources to help other users within the same organization

Page 3: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

3

Tasks 1, 2, 3

Task 3

Task 1

Execution Host 1

ExecutionHost 2

Execution Host 3

Master HostSubmission Host

Task 2

Approach: Adapt and use a Job Management System

Page 4: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

4

Approach:

• Select the most suitable existing Job Management System (JMS)

• Extend this JMS to recognize and utilize reconfigurable resources

- identify and define functional requirements- rank known systems according to these requirements- identify which JMS is the easiest to extend

- add new dynamic resources- configure scheduling to be based on these new resources

Page 5: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

5

Tasks 1, 2, 3

Task 3

Task 1

Execution Host 1

ExecutionHost 2

Execution Host 3

Master Host

Submission Host

Task 2

Networked Reconfigurable Resource Management System

FPGAboards

Page 6: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

6

Myrinet SAN/LAN

Switch

WILDFORCE

Dell

WILDSTAR

Dell

SLAAC

Dell

WILDSTAR

Dell

WILDFORCE

Dell Sparc 10

SLAAC Research Reference Platform

Ethernet Intelligent Hub 100

Mbps

Heterogeneous network with FPGA-based accelerators

Dell HP

Sparc 20 DellGateway

SLAAC WILDSTAR

WILDFORCE SLAAC

Ethernet Intelligent Hub 100

Mbps

Page 7: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

7

Functional units of a typical Job Management System

jobs & their requirements

UserServer

Job SchedulerResourceMonitor

availableresources

resource requirements

scheduling policies

JobDispatcherresource allocation

and job execution

Resource Manager

Page 8: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

8

Classification of Investigated Systems (1)

Centralized JMS

DistributedJMS w/o a Central Scheduler

DistributedOperating

System

• LSF• CODINE• PBS• Condor• RES

• Globus• Legion• NetSolve

• MOSIX

Page 9: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

9

ParameterStudy

Scheduler

ResourceMonitor andForecaster

DistributedComputingInterface

• Compaq DCE• AppLES • NWS

Classification of Investigated Systems (2)

Page 10: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

10

Operating system, flexibility, user interface

LSF Codine PBS CONDOR RES

Distribution

Source code

OS Support

User Interface

SolarisLinuxTru64NT

GUI &CLI

CLI

com pub pub/com pub gov

GUI &CLI

GUI &CLI

GUI &CLI

Page 11: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

11

Scheduling and Resource Management

LSF Codine PBS CONDOR RES

Batch jobs

Interactive jobs

Parallel jobs

Accounting

Page 12: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

12

Efficiency and Utilization

LSF Codine PBS CONDOR RES

Stage-in andstage-out

Timesharing

Process migration

Dynamic loadbalancing

Scalability

Page 13: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

13

Fault Tolerance and Security

LSF Codine PBS CONDOR RES

Checkpointing

Daemon fault recovery

Authentication

Authorization

Page 14: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

14

Documentation and Technical Support

LSF Codine PBS CONDOR RES

Documentation

Technicalsupport

Page 15: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

15

JMS features supporting extension to reconfigurable hardware

• capability to define new dynamic resources

• strong support for stage-in and stage-out- configuration bitstreams- executable code- input/output data

• support for Windows NT and Linux

Page 16: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

16

Ranking of Centralized Job Management Systems (1)

Capability to define new dynamic resources:

Excellent: LSF, PBS, CODINEMore difficult: CONDOR, RES

Stage-in and stage-out:

Excellent: LSF, PBSLimited: CONDORNo: CODINE, RES

Page 17: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

17

Ranking of Centralized Job Management Systems (2)

Overall suitability to extend to reconfigurable hardware:

1. LSF2. CODINE3. PBS4. CONDOR5. RES

without changing the JMS source code

requires changes to the JMS source code

Page 18: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

18

Submission host

LIM

Batch API

Master host

MLIM

MBD

Execution host

SBD

Child SBD

LIM

RES

User job

Extension of LSF to reconfigurable hardware (1)Operation of LSF

LIM – Load Information ManagerMLIM – Master LIMMBD – Master Batch DaemonSBD – Slave Batch DaemonRES – Remote Execution Server

queue1

2

3

45

6 7

89

10

11

12

13

Loadinformation

otherhosts

otherhosts

bsub app

Page 19: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

19

Extension of LSF to reconfigurable hardware(2)

Submission host

LIM

Batch API

Master host

MLIM

MBD

Execution host

SBD

Child SBD

LIM

RES

User job

ELIM – External Load Information ManagerACS API – Adaptive Computing Systems API

queue1

2

3

45

6 7

89

10

11

12

13

Loadinformation

otherhosts

otherhosts

bsub app

ELIM

ACS API

14FPGAboard

Statusof theboard

Page 20: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

20

Conclusions (1)

• 12 systems evaluated using 25 functional requirements + the suitability of extension to support reconfigurable hardware

• LSF, CODINE, PBS, and Condor ranked the highest in the functional requirements

• LSF, CODINE, and PBSPro found easy to extend without changes in their source codes

• LSF most suitable to support reconfigurable hardware

Page 21: 1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University

21

• General software architecture of the extended system developed

• Experimental developments, verification and performance evaluation of the extended system in progress

Conclusions (2)