15
Troxel #197 MAPLD 2004 CARMA: A Comprehensive Management Framework for High-Performance Reconfigurable Computing Ian A. Troxel, Aju M. Jacob, Alan D. George, Raj Subramaniyan, and Matthew A. Radlinski High-performance Computing and Simulation (HCS) Research Laboratory Department of Electrical and Computer Engineering University of Florida

Troxel#197 MAPLD 2004 CARMA: A Comprehensive Management Framework for High-Performance Reconfigurable Computing Ian A. Troxel, Aju M. Jacob, Alan D. George,

Embed Size (px)

Citation preview

Troxel #197 MAPLD 2004

CARMA: A Comprehensive Management Framework for High-Performance Reconfigurable Computing

Ian A. Troxel, Aju M. Jacob, Alan D. George,

Raj Subramaniyan, and Matthew A. Radlinski

High-performance Computing and Simulation (HCS) Research Laboratory

Department of Electrical and Computer Engineering

University of Florida

Gainesville, FL

#197 MAPLD 2004Troxel 2

CARMA Motivation Key missing pieces in RC for HPC

Dynamic RC fabric discovery and management Coherent multitasking, multi-user environment Robust job scheduling and management Design for fault tolerance and scalability Heterogeneous system support Device independent programming model Debug and system health monitoring System performance monitoring into the RC fabric Increased RC device and system usability

Our proposed Comprehensive Approach to Reconfigurable Management Architecture (CARMA) attempts to unify existing technologies as well as fill in missing pieces

CARMA(Holy Fire by Alex Grey)

#197 MAPLD 2004Troxel 3

CARMA Framework Overview CARMA seeks to integrate:

Graphical user interface Flexible programming model COTS application mapper(s)

Handel-C, Impulse-C, Viva, System Generator, etc. Graph-based job description

DAGMan, Condensed Graphs, etc. Robust management tool

Distributed, scalable job scheduling Checkpointing, rollback and recovery Distributed configuration management

Multilevel monitoring service (GEMS) Networks, hosts, and boards Monitoring down into RC Fabric

Device independent middleware API Multiple types of RC boards

PCI (many), network-attached, Pilchard Multiple high-speed networks

SCI, Myrinet, GigE, InfiniBand, etc.

Applications

RC ClusterManagement

DataNetwork

Algorithm Mapping

PerformanceMonitoring

MiddlewareAPI

UserInterface

COTSProcessor

RC FabricAPI

RC Fabric

RC Node

To OtherNodes

ControlNetwork

#197 MAPLD 2004Troxel 4

Application Mapper Evaluation Evaluating on basis of ease of use, performance, hardware device independence, programming

model, parallelization support, resource targeting, network support, stand-alone mapping, etc.

C-Based tools Celoxica - SDK (Handel-C)

Provides access to in-house boards: ADM-XRC (x1), Tarari (x4), RC1000 (x4) Good deal of success after lessons learned Hardware design focused

Impulse Accelerated Technologies – Impulse-C Provides an option for hardware independence Built upon open source Streams-C from LANL Supports ANSI standard C

Graphical tools StarBridge Systems - Viva Nallatech – Fuse / DIMEtalk Annapolis Micro Systems - CoreFire

Xilinx - ISE compulsory Evaluating the role of Jbits, System Generator, and XHWIF

Evaluations still ongoing Programming model a fundamental issue to be addressed

Streams-C c/o LANL

#197 MAPLD 2004Troxel 5

CARMA Interface Simple graphical user interface

Preliminary basis for graphical user interface via the Simple Web Interface Link Library (SWILL) from the University of Chicago*

User view for authentication and job submission/status Administration view for system status and maintenance

Applications supported Single or multiple tasks per job (via CARMA DAGs**) CARMA registered (via CARMA API and DAGs) or not

Provides security, fault tolerance Sequential and parallel (hand-coded or via MPI)

C-based application mappers supported CARMA middleware API provides architecture independence

Any code that can link to the CARMA API library can be executed (Handel-C and ADM-XRC API tested to date)

Bit files must be registered with the CARMA Configuration Manager (CM) All other mappers can use “not CARMA registered” mode Plans for linking Streams/Impulse-C, System Generator, et al.

* http://systems.cs.uchicago.edu/swill/

** Similar to Condor DAGs

#197 MAPLD 2004Troxel 6

CARMA User Interface

#197 MAPLD 2004Troxel 7

CARMA Job Manager (JM) Prototyping effort (CARMA interoperability)

Completed first version of CARMA JM Task-based execution via Condor-like DAGs Separate processes and message queues for fault-tolerance Checkpointing enabled with rollback in progress Links to all other CARMA components Fully distributed multi-node operation with job/task migration Links to CARMA monitor and GEMS to make scheduling decisions Tradeoff studies and analyses underway

External extensions to COTS tools (COTS plug and play) Expand upon preliminary work @ GWU/GMU*

Striving for “plug and play” approach to JM CARMA Monitor provides board info. (via ELIM) Working to link to CARMA CM Tradeoff studies and analysis underway Integration of other CARMA components in progress

c/o GWU/GMU* Kris Gaj, Tarek El-Ghazawi, et al., “Effective Utilization andReconfiguration of Distributed Hardware Resources UsingJob Management Systems,” Reconfigurable ArchitectureWorkshop 2003, Nice, France, April 2003.

Hyper.1 Hyper.2

Hyper.3 Hyper.4

Hyper.5

1

stdout

1

1 2

1 1 2

1 2

12 3

1 2

1 2

1 12

1

File1

File2

2

CARMA DAG Example

#197 MAPLD 2004Troxel 8

CARMA CM Design Builds upon previous design concepts* Execution Manager (EM)

Forks tasks from JM and returns results to JM Requests and releases configurations

Configuration Manager (CM) Manages configuration transport and caching Loads, unloads configurations via BIM

Board Interface Module (BIM) Provides board independence Allows for configuration temporal locality benefits

Communication Module Handles all inter-node communication

ExecutionManager

ConfigurationManager

RC Hardware

Comm. Communication

Remote Node

Control Network

File Transfers

Local Node

Inter-Process Comm.BIM

Board API

RC Board

Application

BIM

CM spawnsBIM for each

Board

RC Board

BIM

CM

Board SpecificCommunication

CARMA BoardInterface Language

CM uses BIM toConfigure Board Board Interface Module (BIM)

Configures and interfaces with diverse set of RC boards Numerous PCI-based boards Various interfaces for network attached RC

Instantiated at startup Provides hardware independence to higher layers Separate BIM for each supported board Simple standard interface to boards for remote nodes Enhances security by authenticating data and configurations

* U. of Glasgow (Rage), Imperial College in UK, U. Washington, among others

#197 MAPLD 2004Troxel 9

Distributed CM Management Schemes

Jobs submitted “centrally”APP

APP MAP

GJM

GRMAN

Network

LRMON

Local SysLRMON

Local Sys…

Tasks,States

Results,Statistics

Global view ofthe system at all times

LAPP

LAPP MAP

LJM

LRMAN

Network

LRMON

Local Sys …Configurations

Requests

LAPP

LAPP MAP

LJM

LRMAN

LRMON

Local Sys

Jobs submittedlocally

Requests

Master-Worker (MW) Client-Server (CS)

Client-Broker (CB) Simple Peer-to-Peer (SPP)

Global view ofthe system at all times GRMAN

LAPP

LAPP MAP

LJM

LRMAN

Network

LRMON

Local Sys…

Tasks,Configurations

Requests,Statistics

LAPP

LAPP MAP

LJM

LRMAN

LRMON

Local Sys

Jobs submittedlocally

Requests,Statistics

Server houses configurations

Global view ofthe system at all times GRMAN

LAPP

LAPP MAP

LJM

LRMAN

Network

LRMON

Local Sys …

Configuration Pointers

LAPP

LAPP MAP

LJM

LRMAN

LRMON

Local Sys

Jobs submittedlocally

Requests,Statistics

Configurations

Server brokers configurations

Requests,Statistics

Note: More in-depth results for distributed CM appeared at ERSA’04

#197 MAPLD 2004Troxel 10

CM System Recommendations

System ConstraintsSystem Size (number of nodes)

< 8 8 to 32 32 to 512 512 to 1024 1024 to 4096

Latency bound Flat CSCS over CS with

group size 4SPP over CS

with group size 4SPP over SPP

with group size 8SPP over SPP

with group size 16

Bandwidth bound* Flat CSCS over CS with

group size 4CS over CS with

group size 8SPP over CS with

group size 8SPP over CS with

group size 8

Best Overall Flat CSCS over CS with

group size 4SPP over CS

with group size 4SPP over CS with

group size 8SPP over CS with

group size 8

Conclusions CARMA CM design imposes very little overhead on the system Hierarchical scheme needed to scale to systems of thousands of nodes (traditional MW will not work) Multiple servers for CS scheme don’t reduce the server bottleneck for system sizes greater than 32 SPP over CS (group size 8) best overall performance for systems larger than 512 nodes

* Schemes with completion latency valuesgreater than 5 seconds excluded

Scalability projected up to 4096 nodes Performed analytic scalability analysis based on 16-node experimental results

Dual 2.4GHz Xeons and a Tarari CPX2100 HPC board in a 64/66 PCI slot Gigabit Ethernet and 5.3 Gbps Scalable Coherent Interface (SCI) control and data networks respectively

Flat system of 4096 has very high completion times (~5 minutes for SPP and ~83 hrs for CS) Layered hierarchy needed for reasonable completion times (~2.5 sec for SPP over SPP at 4096 nodes) CS reduces network traffic by sacrificing response time and SPP improves response time by increasing

network utilization

#197 MAPLD 2004Troxel 11

CARMA Monitoring Services Monitoring service

Statistics Collector Gathers local and remote information Updates GEMS* and local values

Query Processor Processes task scheduling requests from JM Maintains local information

Round-Robin Database Compact way to store performance logs Supports simple query interface

CARMA Diagnostic System watchdog alerts based on defined

heuristics of failure conditions Provides system monitoring and debug

Initial monitor version is complete Studying FPGA monitoring options Increasing the scheduling options Tradeoff studies and analyses underway

Initial CARMA Monitor ParametersA) Stats from JM, ExMan, ConMan, BIM, Board-Dynamic statistics (push or pull)-Static statistics (pull)B) Stats from remote nodes via GEMSC) StatCollector passes info to the RRD from local and remote modules via the Query ProcessorD) JM queries RRD for resource information to make scheduling decisionsE) The CARMA diagnostic tool performs system administration, debug and optimization

To OtherNodes

JM

ConMan

ExMan

FPGABoard

Stat.Collector GEMSB

BIM BIM BIM

FPGABoard

FPGABoard

RRDQueryProc.

CARMADiagnostic

A

A

A

A

A

C

D

E

*

*

*

*

* Gossip-Enabled Monitoring Service (GEMS); developedby HCS Lab for robust, scalable, multilevel monitoringof resource health and performance.For more info. see http://www.hcs.ufl.edu/gems

#197 MAPLD 2004Troxel 12

CARMA End-to-End Service Description Functionality demonstrated to date Graphical user interface Job/task scheduling based on board requirements and

configuration temporal locality Parallel and serial jobs CARMA registered and non-registered tasks Remote execution and result retrieval Configuration caching and management Mixed RC and “CPU-only” tasks Heterogeneous board execution (3 types thus far) System and RC device monitoring Inter-node communication via SCI or TCP/IP/GigE Fault-tolerant design

Processes can be restarted while running

Virtually no system impact from CARMA overhead despite use of unoptimized code Less than 5MB RAM per node Less than 0.1% processor utilization on a 2.4 GHz Xeon

server Less than 200 Kbps network utilization

CARMA Execution Stages

1) User submits job2) JM performs a task schedule request and monitor replies with execution location3) JM forwards tasks to local or remote ExMan4) If task requires an RC board, ExMan sends a configuration request to the local CM5) The CM finds the file and configures the board6) The user’s task is forked (runs on processor)7) Users access RC boards via the BIM8) Task results are forwarded to the originating JM9) Job results are forwarded to the originating userNote: All modules update the monitor

CARMA Node

JMExMan

CM

BIM

uP

MonitorRC

Fabric

3

7

6

4

5

UI

1

2

3

5

7

8

8

9

#197 MAPLD 2004Troxel 13

CARMA Framework Verification Several test jobs executed concurrently

Parallel Add Test composed of ADD.exe, a “CPU-only” task to add two numbers AddOne.bit, an RC task to increment input value

Parallel N-Queens Test composed of ADD.exe, a “CPU-only” task to add two numbers NQueens.bit, an RC1000 task to calculate a subset of

the total number of solutions for an N×N board 4 RC1000s and 4 Tararis communicating via MPI

Parallel Sieve of Erasthones (on Tarari) Parallel Monte Carlo Pi Generator (on Tarari) Blowfish encrypt/decrypt (on ADM-XRC)

ADD.exe

AddOne.bit

12

32

ADD.exe

5

5

6

AddOne.bit

11

Par. Add TestN-Queens Test

Example System Setup

These simple applications used to test CARMA’s functionality, whileCARMA’s services have wider applicability toproblems of greater size and complexity.

NQueens.bit

92

ADD.exe

44

8

Xeon Server

JMExMan

CM

BIM

uP

MonitorTarari

Xeon Server

JMExMan

CM

BIM

uP

MonitorRC1000

UI

Xeon Server

CM

Config. Store

GEMS

SCI (Config. Files)

SWILL Interface

TCP/IP (Requests)

TCP/IP (Tasks and Results)

Xeon Server

JMExMan

CM

BIM

uP

MonitorADMXRC

#197 MAPLD 2004Troxel 14

Conclusions First working version of CARMA complete & tested

Numerous features supported Simple GUI front-end interface Coherent multitasking, multi-user environment Dynamic RC fabric discovery and management Robust job scheduling and management Fault-tolerant and scalable services by design Performance monitoring down into the RC fabric Heterogeneous board support with hardware independence Linking to COTS job management service

Initial testing shows the framework to be sound with very little overhead imposed upon the system

#197 MAPLD 2004Troxel 15

Future Work and Acknowledgements Continue to fill in additional CARMA features

Include support for other boards, application mappers, and languages Complete JM rollback feature and finish linkage to LSF Include broker and caching mechanisms for the peer-to-peer distributed CM scheme Include more intelligent scheduling algorithms (e.g. Last Release Time) Expand RC device monitoring and include debug and opt. mechanisms Enhance security including secure data transfer and authentication Deploy on a large-scale test facility

Develop CARMA instantiations for other RC domains Distributed shared-memory machines with RC (e.g. SGI Altix) Embedded RC systems (e.g. satellite/aircraft systems, munitions)

We wish to thank the following for supporting this research: Department of Defense Xilinx Celoxica Alpha Data Tarari Key vendors of our HPC cluster resources (Intel, AMD, Cisco, Nortel)