Upload
maree
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
F1-07: Simulative Performance Prediction of RC Systems for RC course 08. Presented by: Gongyu Wang PhD student, F1, CHREC. Goals, Motivations. Goals Develop the first tool for simulative performance prediction of complex RC systems and apps - PowerPoint PPT Presentation
Citation preview
December 5-6, 2007
2007 Annual Workshop
F1-07: Simulative F1-07: Simulative Performance Prediction of Performance Prediction of RC Systems RC Systems for RC course 08for RC course 08
Presented by:Gongyu Wang
PhD student, F1, CHREC
2
Goals, Motivations Goals
Develop the first tool for simulative performance prediction of complex RC systems and apps
Explore design tradeoffs of complex, multi-paradigm systems & applications via modeling and simulation
Motivations Provide an efficient, comprehensive method of evaluating and
prototyping RC systems Facilitate fast system design tradeoffs Enable application mapping/decomposition analyses
without hardware implementations FIDELITYSPEED
3
RC Simulation Framework
RC Simulation Framework Diagram
6 key components of framework depicted in figure Many key tasks can be
completed independently and in parallel
Framework allows arbitrary applications to be simulated on any arbitrary systems Component models and
application scripts can be reused for rapid simulative analyses
System models driven by application scripts produce simulative performance prediction results Systems modeled in 2007 included socket-connected FPGA platform (XD1000), PCI-based
server cluster (Nallatech cluster), and custom supercomputer FPGA platform (SRC-6)
4
RC Simulation Framework Application Scripts
A simple, customable script format provides interface between domains
Scripts characterize high-level behavior of application through defining key events Key events include network transactions,
processor computation blocks, RC core processing, data transfers with RC devices
Simulation speed enhanced by abstracting away computation performed by the non-RC portion of the system
RC Events contain transfer size, core configurations, etc.
MPI Events contain data size, destination and source information, transfer type, etc.
Architecture Modeling Modeling and simulations
performed in discrete-event environment called Mission-Level Designer (MLD) Hierarchical, block-based
modeling environment Customized models
developed via C-style programming
Sample Application Script
Sample View of MLD Models
Sample 4-Node SystemNode-level Model
5
Supported Features Double buffering
A series of double-buffered FPGA core requests can be specified by a single script line Use the optional area of the script line
CPU computation blocks, performed per data chunk and in parallel with FPGA processing, defined via pre-chunk and post-chunk lines
Power modeling Exact determination of FPGA power
consumption is a complex task Dependent on results of place-and-route,
values of input data that drives signal changes throughout fabric, etc.
Quick FPGA power consumption estimates can be obtained via worksheet method Such power worksheets provided by both Altera and Xilinx Power estimates rendered from device technology, resource usage, signal switching rate,
clock frequency
//Script for double-buffered FPGA execution
rc_core_request_db 1 FFT 8192x1024 2 0comp_prechunk 5.0comp_postchunk 10.0
Example of Single/Double Buffering
6
SRC-6 System Modeling -- Methodologies Architecture Components Modeled
Microprocessor Node MAP Node (user FPGA inside) Hi-Bar Switch
Interfaces included in MAP and SNAP microprocessor models
FIFO queue for each output port Delay calculation based on sustainable payload BW and documented latency
Simulation assumptions Map_allocate and Map_free take constant time to execute
Assumption appears accurate based on experimental observations FPGA configuration time is measured from benchmark and serves as a
constant value in the model Configured, tunable in parameter file
Models for now only account for simple and common MAP functions Constrained by the black box model of FPGAs in our framework
7
SRC-6 System Modeling – MLD Models
*Models not yet complete
8
List of Simulative Results Compiled Validation Results
Single-node SAR (Delta, XD1000, SRC-6), two data sets Single-node MD (XD1000) Single-node TD (Delta, SRC-6) Parallel TD w/ two (2) and four (4) nodes (Delta) Single-node HSI (Delta, SRC-6)
Included use two FPGA cores, ACSM and TD Simulative Case Studies
SAR performance vs. I/O parameters (Delta) SAR performance vs. FPGA size (Delta, SRC-6) SAR performance vs. enhanced core design (Delta, SRC-6, XD1000) ACSM speedup vs. # of SRAM banks (XD1000) ACSM speedup vs. # of spectral bands (Delta) ACSM speedup vs. system size (XD1000, Delta) HSI vs. system size and data network (Delta, XD1000) MD speedup vs. data set size (XD1000) MD speedup vs. system size (XD1000) MD speedup vs. core design/parallelization strategy (XD1000)
9
SAR Simulative StudiesSystem Image SW Exp.
RuntimeRC Sim. Runtime
Predicted Speedup
RC Exp. Runtime
% Diff.
Delta, 1 Node A 267.55s 291.21s 0.92 297.15s 2.00%
SRC-6 A 394.25s 238.45s 1.65 232.1s 2.74%
Delta, 1 Node B 59.03s 64.71s 0.91 63.82s 1.70%
SRC-6 B 122.85s 78.57s 1.56 79.04s 0.60%
SAR notes Image A = 5616x27990 pixels Image B = 5616x8192 pixels
SRC-6 contains relatively small FPGA Only one single-buffered FFT
fits on device Following chart predicts
performance when larger FPGA is available
SAR Validation Summary
SAR Runti me vs. FPGA Si ze
180
190
200
210
220
230
240
250
1 2 4 8
Number of FFT cores
Exec
utio
n Ti
me (
sec)
10
SAR Simulative Studies
System Image Size SW Exp. Runtime
RC Sim. Runtime
Predicted Speedup
Delta Node A 267.55 239.73 1.12
XD1000 A 185.08 147.06 1.26
SRC-6 A 394.25 207.68 1.90
Delta Node B 59.03 53.27 1.11
XD1000 B 40.53 32.68 1.24
SRC-6 B 122.85 61.88 1.99
In two stages, an FFT and IFFT separated by a singe vector multiply (VM) Currently, VM is performed by host processor
Enhanced core combining FFT, VM, and IFFT simulated on all three systems for 2 image sizes
Table below summarizes prediction results using enhanced SAR core On Delta, FPGA now produces speedup instead of slowdown, since I/O bottleneck is
minimized On XD1000 and SRC-6, very little additional speedup is predicted, since FPGA
transfers were not a bottleneck in the baseline
SAR Enhanced Core Summary
December 5-6, 2007
2007 Annual Workshop
F1-08 : RCMLF1-08 : RCMLQuick overview for RC Quick overview for RC course 08course 08
12
Goals, Motivations, and Challenges
Goals Research concepts for an RC abstraction layer featured in
app formulation stage Allow specification of design/architecture via standardized high-
level descriptions Create mapping of abstract descriptions into script format that
can be used by system models to drive simulative perf. predictions
Demonstrate methods using proof-of-concept case studies Explore methods for enhanced modeling of FPGA core
designs Motivations
Formulation is often neglected/bypassed during development of RC applications Promote use of formulation with new abstraction layer and
language Provide user-friendly interface to simulation framework
Formulation Stage
Design Stage
Performance Prediction via F1 Simulation Framework
Abstract RC Language Representation
Code Template(s)
Algorithm/Architecture Exploration
Conceptual Flow of RC Formulation Stage
13
Introduction Build RCML on top of AADL
AADL is an SAE standard, recommended by multiple CHREC sponsors
Lacks algorithm exploration constructs, thus RCML will need to add this functionality
RCML should be designed without consideration of AADL mapping
Separation of algorithm model from architecture model RCML composed of concepts & structures for RC algorithms and apps
Even though we’re building this on top of AADL specs and tools, RCML should be considered separate from AADL
Algorithm specification will stand alone, independent of platform details (to a certain degree) Stored as pure SW AADL spec
Platform architectures specified independently, based on AADL hardware classes and models Library of common, tunable components to be included
A mapping procedure and file defined to map RCML algorithm model to architecture model Mapping files connect otherwise separate alg. and arch. files A tool will parse software, hardware and mapping files into comprehensive AADL HW/SW system
specification
Classes of AADL Components
14
RCML Algorithm Constructs Not all RC applications easily represented within one modeling domain
Environments like Ptolemy and MLD support domains that include data flow, FSM, discrete-event, continuous time, etc.
Need to support multiple models of computation in formulation Otherwise, usefulness of formulation language is limited
To address domain issue, multiple classes of function blocks and ports defined Data ports - used to transmit data sets or streams between data- and/or
control-driven blocks Control ports - used to transmit control signals between blocks, trigger
control-driven blocks
LabView Supported Programming Domains
15
RCML Algorithm Constructs Function blocks represent fundamental element of RCML designs
Function blocks represent individual portions of algorithm Function blocks defined using pre-conditions, post-conditions, and properties
No code is defined within block, just defined properties Ports on function blocks define how block interacts with remainder of algorithm
Data-driven function block A function block that only contains data inputs Execution triggered by receiving data on input data ports
Combination of input events required for triggering defined in block’s pre-condition For support of data flow and discrete-event models
Control-driven function block A function block that contains control input(s) Execution triggered by changes to received control signals Support FSM defined behavior
Specialized Controller function block defined for creating application controllers Allow FSMs to be built inside controller Controller should only accept and output control signals which drive external control-
driven function blocks
16
Conclusions Developed and demonstrated framework for timely
performance prediction of RC systems and applications Three classes of RC systems modeled and presented
RC cluster, FPGA Co-processor platform (XD1000), and custom supercomputing platform (SRC)
Simulative experiments conducted on each platform with multiple applications
Synthetic Aperture Radar (SAR), Hyper-Spectral Imaging (HSI), and Molecular (MD)
Proposed RCML to address the formulation stage of RC application development process
Build upon AADL Architectural modeling methodology inherited from F1-07 Algorithmic modeling methodology constructed