Upload
haque
View
226
Download
0
Embed Size (px)
Citation preview
Embedded Computer Architecture Laboratory:A Hands-on Experience Programming p g gEmbedded Systems with Resource and Energy Constraints
Ashkan Beyranvand Nejad, Andrew Nelson, Anca Molnos,Davit Mirzoyan, Sorin Cotofana (Delft University of Technology, NL)
G ( dh f h l )Kees Goossens (Eindhoven University of Technology, NL)
1Challenge the futureComputer Engineering Group
Outline
• Introduction
• ECA-Lab Context & Setup
• Target Embedded MPSoCTarget Embedded MPSoC
• Target Application
• Assignment overview
• Conclusions
• Future Labs
2Challenge the futureComputer Engineering Group
Introduction
• Embedded Systems (ES)
• widely used in various domainswidely used in various domains• towards Multi-Processor Systems-on-Chip (MPSoC) architecture• typically resource constrained, e.g.,
• Limited memoryLimited memory• deep SW stack not available
• Energy bound
• resources are shared to reduce cost• to exploit this the applications should be parallelized
3Challenge the futureComputer Engineering Group
Introduction
ES Ed ti• ES Education• aims to prepare students with required skills for undertaking embedded
system projects in nowadays international companies
• One of the challenging projects is to parallelize applications and port them on embedded MPSoC platforms
• Parallelization increases complexity of understanding the available trade-offs for example for performance and energy
• In academia, we should enable students to gain hands-on experience of these challenges and allow them to experimentally explore the trade-offs in design space for programming embedded MPSoCs
4Challenge the futureComputer Engineering Group
in design space for programming embedded MPSoCs
ECA-Lab G l & C t tGoal & Context
• We set up a lab assignment that target embedded MPSoCs• The lab is given to first year master students concurrently with theoretical
lectures on embedded computer architecturelectures on embedded computer architecture• So we call it ECA-Lab
• Five exercises are prepared to walk the students through some challenges of mapping, parallelizing and optimizing an application on a MPSoC platformpp g, p g p g pp p
• initially, provided the sequential C code of a fractal application
• Student are assigned to a group of four people with as much intra group g g p p p g ptechnical skills and nationality diversity as possible
5Challenge the futureComputer Engineering Group
ECA-Lab S tSetup
Th li ti l l tf i l t d FPGA b d• The application runs on a real platform implemented on an FPGA board• The board is accessible remotely on-line by all the groups
• there is a round-robin arbitration scheme
locking the boa d b one g o p is p e ented• locking the board by one group is prevented
• access time is bounded ~ 30 Sec.
• Keeping board access time low• the HW bit stream resides on the server• the HW bit-stream resides on the server
• only, the elf files per core are transferred
• Output information is parsed locally
• Low cost of setup• Low cost of setup• Xilinx ML605 FPGA-board ~ $1800
• Simple Linux server ~ $0
• Internet connection ~ $0
6Challenge the futureComputer Engineering Group
• Internet connection $0
Multiprocessor Platform C S CCompSoC
R h MPS C l tf d l d i TUD lft d TU/• Research MPSoC platform developed in TUDelft and TU/e• Tiled-based, Network-on-Chip (NoC) centric• Distributed memory architecture
CompSoC aims to ed ce s stem comple it th o gh t o techniq es• CompSoC aims to reduce system complexity through two techniques• composability: each application can be developed, verified, and executed in
isolation
• predictability: real time applications can be implemented in CompSoC by well• predictability: real-time applications can be implemented in CompSoC by well-
defining timing properties of the (shared) resources.
• Platform consists of• the AEthereal / aelite / daelite NoC• the AEthereal / aelite / daelite NoC
• multiple MicroBlaze tiles with local memories and DMAs
• the MicroBlazes optionally run the CompOSe real-time operating system (RTOS)
• the Predator real-time DRAM memory scheduler and controller
7Challenge the futureComputer Engineering Group
• the Predator real time DRAM memory scheduler and controller
Multiprocessor Platform ECA C S CECA-CompSoC
• An instance of CompSoC platform with heterogeneous processor tiles
DmemImemsys timer
Tile 0Tile 0
DmemImemsys timer
Tile 1Tile 1
DmemImemsys timer
Tile 2Tile 2
SHsr
DMAmem
DMAsr
RDMA1RDMA0
lcl
tim
erC
M
MicroblazeSHsr
DMAmem
DMAsr
RDMA1RDMA0
lcl
tim
erC
MMicroblaze
Dmem
SHsr
DMAmem
RDMA0
lcl
tim
erC
M
Microblaze
NoC
RDMA1RDMA0
VFC RDMA1RDMA0
VFC RDMA0
VFC
Frame MemoryGlobal shared memory
Monitor Tile
8Challenge the futureComputer Engineering Group
Tile
ECA-CompSoCP Til DImemsys timerProcessor Tile
Microblaze
Dmem
SHsr
DMAmem
DMAsr
Imemsys timer
lcl
tim
erME h il i RDMA1RDMA0
VFC
M• Each processor tile comprises • a MicroBlaze processing core
• a Voltage-Frequency Control Module (VFCM)
l l i t ti (I )
Memory block
Size (Byte)
• a local instruction memory (Imem),
• a local data memory (Dmem)
• a set of Remote Direct Memory Access (RDMA)
mod les ith an associated set of local memo Imem 8 K
Dmem 8 K
DMAmem 4 K
modules with an associated set of local memory
blocks for inter-tile communication
• independent clock domain per tile• allowing independent voltage and frequency scaling
DMAsr 32
SHsr 32
• allowing independent voltage and frequency scaling
by VFCM
• Dmem and Imem are too small• makes fitting program code into them challenging
9Challenge the futureComputer Engineering Group
• makes fitting program code into them challenging
ECA-CompSoCM TilMemory Tile
• Two memory tiles accessible via the NoC• Frame Memoryy
• frame buffer, where visual data can be written for
display
• an API call from the application signals that the
Memory Tile Size (Byte)
frame is ready, initiating its retrieval to the student's
computer.
• Global shared memory
Global shared mem.
32 K
Frame mem. 256 K
• relatively large
• writing and reading data, to and from this memory,
is slow
• A variety of memories of different capacities• not all accessible from every tile
• not the same accessing bandwidth
10Challenge the futureComputer Engineering Group
ECA-CompSoCN C I t tNoC Interconnect
Dmem
SHsr
Imemsys timer
cl mer
Microblaze
Dmem
SHsr
Imemsys timer
cl mer
Dmem
SHsr
Imemsys timer
cl mer
Microblaze
Tile 0Tile 0 Tile 1Tile 1 Tile 2Tile 2
MicroblazeDMAmem
DMAsr
RDMA1RDMA0
lcti
mV
FC M
MicroblazeDMAmem
DMAsr
RDMA1RDMA0
lcti
mV
FC M
DMAmem
RDMA0
lcti
mV
FC M
MicroblazeMicroblaze
NoC
Frame MemoryGlobal shared memory
Tile Connection Bandwidth
Til 0RDMA0 -> Glbl. Shrd. Mem. Slow
memory
Tile 0RDMA1 -> SHsr Medium
Tile 1RDMA0 -> Glbl. Shrd. Mem. Slow
RDMA1 -> SHsr Medium
11Challenge the futureComputer Engineering GroupTile 2
RDMA0 -> Glbl. Shrd. Mem. Fast
RDMA0 -> frame mem. Fast
RDMA0 -> SHsr Medium
ECA-CompSoCA li ti P i I t f (API)Application Programming Interface (API)
Remote memory access API
void hw_declare_dmas (int num_dmas);DMA* hw_dma_add (int id, void * base_addr);
id h d i ( id * d t id * i t bl k i DMA * d )
Remote memory access API
void hw_dma_receive (void * dst, void * src, int block_size, DMA * dma);void hw_dma_send (void * dst, void * src, int block_size, DMA * dma);int hw_dma_busy (DMA * dma);
Timer API VFCM APIunsigned int get_system_time();unsigned int get_local_time();
void hw_vfcm_clk_gate (unsigned int t);void hw_vfcm_set_freq (unsigned intfreq_level);
Timer API VFCM API
void print_debug (int value); void print_framebuffer();
Debug API Frame Output API
12Challenge the futureComputer Engineering Group
ECA-CompSoCO t t I f tiOutput Information
F b ff i d d d i i bi (b ) f• Frame buffer is dumped and written in bitmap (bmp) format• The timing, energy, and debug information per tile is provided
13Challenge the futureComputer Engineering Group
Application double x_min = ‐1.5;double x max = 1.5;_ ;double x_step = (x_max‐x_min)/x_size;double y_min = ‐1.5;double y_max = 1.5;double y_step = (y_max‐y_min)/y_size;double x,y;double new x new y;double new_x,new_y;intm,n,num;unsigned char R,G,B;double real = 0.123;double imaginary = 0.745;for(n=0; n<y_size; n++){
• Features of target application• simple to understand algorithm
• easily parallelizablefor(m=0;m<x_size;m++){
x = x_min + x_step *m;y = y_min + y_step * n;for(num=0; ((pow(x,2) + pow(y,2)) <= 4)
&& (num < 0xFF); num++){new x = pow(x 2) ‐ pow(y 2) + real;
• easily parallelizable
• visual output
• Fractal application provided with new_x = pow(x,2) ‐ pow(y,2) + real;new_y = 2*x*y + imaginary;x = new_x;y = new_y;
}FrameMemory.R = num;
Fractal application provided with• sequential C code
• shape of the output is
controlled by a real and FrameMemory.G = num;FrameMemory.B = num;
}}
y
imaginary parameters
• given Five set of parameters
14Challenge the futureComputer Engineering Group
Assignment OverviewAssignment Overview
15Challenge the futureComputer Engineering Group
Assignment OverviewE i 1Exercise 1
E h h ll li h f l• Each group has to parallelize the fractal application on a desktop computer utilizing the pthreads libraryP id M k fil• Provide a Makefile
• Make run-fractal-pthreads THREAD=X
• Goal• Goal• get the students acquainted to the
fractal application and parallel
programming on a desktop environmentprogramming on a desktop environment
• analyze and explain parallelization
performance
16Challenge the futureComputer Engineering Group
Assignment OverviewE i 2Exercise 2
E h h d h• Each group has to map and execute the sequential fractal app on one core of the platform
• Goal• to familiarize the student to the ECA-
CompSoC platfo m and challenges ofCompSoC platform and challenges of
fitting their code in small Imem
• to get them to know about performance
and energy consumption estimationand energy consumption estimation
17Challenge the futureComputer Engineering Group
Assignment OverviewE i 3Exercise 3
E h h ll li h• Each group has to parallelize the application on at least two processor-tiles
• They have to evaluate the performance and energy consumed by the solution
• Goal• to familiarize the student with the
challenges of parallelizing an applicationchallenges of parallelizing an application
on an embedded MPSoC such as
synchronization and memory
consistency problems
18Challenge the futureComputer Engineering Group
consistency problems
Assignment OverviewE i 4Exercise 4
• Each group has to optimize the performance of the parallel application executing on the platform
• The focus should be on multi-core strategies, e.g., computation versus communication
• The quality of solution is graded according to provided performance list:1. Execution-time > 35000000 cycles
2. Execution-time in [30000000 35000000] cycles
3. Execution-time < 30000000
• Goal• to gain experience with performance optimization on an embedded platform and the
19Challenge the futureComputer Engineering Group
trade-offs involved
Assignment OverviewE i 5Exercise 5
• Each group has to minimize the energy consumed by their mapped application such that the execution finishes before a deadline
• Each group receives a different value for the deadline
• Energy minimization could be achieved by:1. Click gating the cores for a period of time
2. Scaling down the (voltage and) frequency of the core for a period of time
• The groups may chose to combine both of these techniques
• Goal• to introduce the students with performance-constrained energy optimization on an
embedded platform
20Challenge the futureComputer Engineering Group
Assignment OverviewT d ff lTrade-off results
21Challenge the futureComputer Engineering Group
Conclusions
• Students that follow our laboratory have gained hands-on experience programming a multi-core embedded systemp g g y
• They will have overcome the difficulties of programming for a resource constrained platform with limited debug visibility
• They will have investigated their solution‘s design space for both performance and energy consumption, learning the trade-offs that exist on such a platform
• The students works in multi-cultural groups, with diverse backgrounds and experience, similar to what is found in international companies and academia
• The lab. given successfully in the 2011-2012 academic year at Delft University of Technology
• The students’ feedback was in general positive. However, in practice the students were spending more time than scheduled working on the laboratory
22Challenge the futureComputer Engineering Group
project
Future ECA-LabsThiThis year
• Target a new application that receives also an input data• ECA-Lab 2012-2013: Instagram-like image filtering application
Filter app. runson
ECA-CompSoCECA CompSoC
23Challenge the futureComputer Engineering Group
Thank You!
For further information please visitwww.compsoc.eu
24Challenge the futureComputer Engineering Group