High-performance and Hardware-aware Computing
Proceedings of the First International Workshop on New Frontiers in
High-performance and Hardware-aware Computing (HipHaC’08)
Lake Como, Italy, November 2008 (In Conjunction with
MICRO-41)
High-performance and Hardware-aware Computing Proceedings of the
First International Workshop on New Frontiers in High-performance
and Hardware-aware Computing (HipHaC’08)
Lake Como, Italy, November 2008 (In Conjunction with
MICRO-41)
Rainer Buchty Jan-Philipp Weiß (eds.)
Universitätsverlag Karlsruhe 2008 Print on Demand
ISBN: 978-3-86644-298-6
Organization
Jan-Philipp Weiß
Steering Committee:
Jürgen Becker
Vincent Heuveline
Wolfgang Karl
Jan-Philipp Weiß
Program Committee:
Mladen Berekovic
Eric D’Hollander
Ghent University, Belgium
Richard Kaufmann
Hewlett-Packard, USA
Paul Kelly
Andy Nisbet
Stephan Wong
tems are complemented by coprocessors, accelerators, and
reconfigurable units providing huge computational power.
However, applications of scientific interest (e.g. in
high-performance computing and numerical simulation) are not
yet ready to exploit the available high computing potential.
Different programming models, non-adjusted interfaces,
and bandwidth bottlenecks complicate holistic programming
approaches for heterogeneous architectures. In mod-
ern microprocessors, hierarchical memory layouts and complex logics
obscure predictability of memory transfers or
performance estimations.
that require additional communication and synchronization.
Currently, a comprehensive knowledge of underlying
hardware is therefore mandatory for application programmers. Hence,
there is strong need for virtualization concepts
that free programmers from hardware details, maintaining best
performance and enable deployment in heterogeneous
and reconfigurable environments.
The First International Workshop on New Frontiers in
High-performance and Hardware-aware Computing
(HipHaC’08) – held in conjunction with the 41st Annual IEEE/ACM
International Symposium on Microarchitecture
(MICRO-41) – aims at combining new aspects of parallel,
heterogeneous, and reconfigurable system architectures
with concepts of high-performance computing and, particularly,
numerical solution methods. It brings together in-
ternational researchers of all affected fields to discuss issues of
high-performance computing on emerging hardware
architectures, ranging from architecture work to programming and
tools.
The workshop organizers would therefore like to thank the MICRO-41
Workshop Chair for giving us the chance to
host this workshop in conjunction with one of the world’s finest
conferences on computer and system architecture –
and of course all the people who made this workshop finally happen,
most notably Wolfgang Karl (KIT) for initial
inspiration. Thanks to the many contributors submitting exciting
and novel work, HipHaC’08 will reflect a broad
range of issues on architecture design, algorithm implementation,
and application optimization.
Karlsruhe, Rainer Buchty & Jan-Philipp Weiß
October 2008 Karlsruhe Institute of Technology (KIT)
Table of Contents
Architectures
OROCHI: A Multiple Instruction Set SMT Processor . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 1 Takashi Nakada, Yasuhiko Nakashima, Hajime Shimada,
Kenji Kise, and Toshiaki Kitamura
Stream Processing and Numerical Computation
Experiences with Numerical Codes on the Cell Broadband Engine
Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 9
Markus Stürmer, Daniel Ritter, Harald Köstler, and Ulrich
Rüde
A Realtime Ray Casting System for Voxel Streams on the Cell
Broadband Engine . . . . . . . . . . . . . . . . . . . . . . 17
Valentin Fuetterling and Carsten Lojewski
Comparison of High-Speed Ray Casting on GPU using CUDA and OpenGL .
. . . . . . . . . . . . . . . . . . . . . . . . . . 25 Andreas
Weinlich, Benjamin Keck, Holger Scherl, Markus Kowarschik, and
Joachim Hornegger
RapidMind Stream Processing on the PlayStation 3 for a 3D
Chorin-based Navier-Stokes Solver . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . 31
Vincent Heuveline, Dimitar Lukarski, and Jan-Philipp Weiß
Temporal Locality
Optimising Component Composition using Indexed Dependence Metadata
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 Lee W.
Howes, Anton Lokhmotov, Paul H. J. Kelly, and A. J. Field
Accelerating Stencil-Based Computations by Increased Temporal
Locality on Modern Multi- and Many-Core Architectures . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 47
Matthias Christen, Olaf Schenk, Peter Messmer, Esra Neufeld, and
Helmar Burkhart
Fast Cache Miss Estimation of Loop Nests using Independent Cluster
Sampling . . . . . . . . . . . . . . . . . . . . . . . . 55 Kamal
Sharma, Sanjeev Aggarwal, Mainak Chaudhuri, and Sumit Ganguly
List of Authors . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 65
OROCHI: A Multiple Instruction Set SMT Processor
Takashi Nakada∗, Yasuhiko Nakashima∗, Hajime Shimada†, Kenji Kise‡
and Toshiaki Kitamura§ ∗Graduate School of Information Science,
Nara Institute of Science and Technology, JAPAN
{nakada, nakashim}@is.naist.jp †Graduate School of Informatics,
Kyoto University, JAPAN
[email protected] ‡Graduate School of Information Science
and Engineering, Tokyo Institute of Technology, JAPAN
[email protected] §Graduate School of Information Sciences,
Hiroshima City University, JAPAN
[email protected]
Abstract—To develop embedded computer systems, one straightforward
way is to employ heterogeneous multi-processors or multi-cores that
have a single traditional core and several SIMD/VLIW cores. This
approach is suitable not only for quick integration of de-facto OS
and new multimedia programs, but also for QoS. However, such
well-known architecture increases the area, the complexity of the
bus structure, the cost of the chip and the inefficient use of the
dedicated cache memory. As an efficient embedded processor, we
propose a heterogeneous SMT processor that has two different
front-end pipelines. Each pipeline corresponds to ARM architecture
for irregular programs and FR-V (VLIW) architecture for multimedia
applications. FR- V instructions run through the simple decoder and
are enqueued into the VLIW queue. The instructions in the VLIW
queue are simultaneously shifted to the next stage after the
instructions at the final portion are all issued. On the other
hand, ARM instructions are decomposed into simple instructions
suitable for the VLIW queue. The instructions are scheduled based
on the data dependencies and the empty slots. After that, the mixed
instructions in the VLIW queue are issued to the common back- end
pipeline. In this paper, a simple instruction scheduler and a
mechanism for QoS are presented. We evaluated the performance with
an RTL-level simulator and evaluated the chip area. The results
show that the microarchitecture can increase the total IPC by 20.7%
compared to a well-known QoS mechanism controlled by a process
scheduler in OS, and can reduce the total chip area by 34.5%
compared to a well-known separated multi-core implementation.
Index Terms—Heterogeneous SMT processor, VLIW, Quality of
Service
I. I NTRODUCTION
In recent years, it has become popular to enjoy high-quality
multimedia contents via portable devices. The processors for such
embedded devices are required to accomplish high per- formance for
multimedia applications and work on ultra low- power to enable use
of smaller batteries. Unfortunately, well- known superscalar
processors are unacceptable for such em- bedded devices on two
counts. First, power-hungry processors with large heatsinks are
hard to fit into the embedded devices that are usually composed in
a small chassis. Second, the processors need to consume less power
so as to extend battery life as much as possible. For this field,
in place of traditional
wide issue superscalars, many heterogeneous multi-cores have been
proposed to meet the requirements. Considering the heavy multimedia
workload in modern embedded devices, VLIW processors are good
candidates because enough ILP in multimedia programs is easily
detected by the compiler so that complicated issue mechanisms can
be omitted. By incorporating well designed multimedia libraries,
VLIW can achieve good performance with low power. However, VLIW is
less competitive in applications with few ILP. Moreover, library
support for general purpose applications is compara- tively poor.
Consequently, some general purpose processors are also included.
This results in a heterogeneous multi-core processor. Thus,
heterogeneous multi-core processors have become popular, as is the
case with the Cell Broadband Engine [1], which includes a POWER
Processing Element (PPE) as a general processor and eight
Synergistic Processing Elements (SPEs) as media processors.
However, from the point of view of semiconductor technol- ogy,
multi-cores that increase the footprint by incorporating discrete
cores straightforwardly are not the best solution because static
power leakage and process variation will be big obstacles in next
generation low power and high-performance processors. Static power
leakage is in proportion to footprint. It is crucial to reduce the
footprint in the near future. In particular, the general processor
in the multi-core is quite large because its design tends to be
imported from traditional imple- mentations, despite the small
performance contribution of the general processor. If we unify the
general purpose processor with media processors such as VLIW, the
footprint is min- imized, and the dedicated cache area of the
general purpose processor can be effectively utilized as an
additional cache for the media processors. Such integration shows
promise in the field of smaller footprints and high-performance.
Meanwhile, conventional SMT execution models [2], which also share
a single pipeline and the data cache, are not suitable for QoS
control in general. However, in many embedded systems, QoS control
is one of the important requirements. The processor has to
guarantee the frame rate for a video decoder, for example. The
heterogeneous SMT for embedded processors should meet
1
demands such as these that are not popular in conventional
SMT.
Therefore, we propose a heterogeneous SMT processor named OROCHI,
which can execute simultaneously both the conventional instruction
set and the VLIW instruction set. By unification of the back-end
pipeline, which includes a load/store unit, the processors based on
different architecture share execution units and a data cache. Each
processor has the opportunity to use more cache area during the
time that the other processor does not need a large cache area.
First, we propose a novel QoS-aware instruction scheduling
mechanism with a VLIW queue that is completely different from
traditional superscalar processors. It schedules VLIW instructions
directly and also transforms conventional instruc- tions
efficiently. Conventional instructions are decomposed into simple
instructions and inserted into the empty slot of the VLIW queue.
Second, we adopt a cache miss prediction mechanism incorporated in
branch predictors and a selective instruction flush mechanism in
the VLIW queue, which are made more effective than previous QoS
control mechanisms by using an OS scheduler [3] or some hardware
approach such as dynamic cache partitioning [4].
The rest of this paper is organized as follows. Section 2 gives an
overview of OROCHI. Section 3 reveals the mi- croarchitecture of
OROCHI. Section 4 describes its evaluation. Finally, Section 5
concludes the paper and describes future work.
II. PREVIOUS WORK ON QOS (QUALITY OF SERVICE)
To sustain the QoS, several methods are proposed. These approaches
are classified into two categories, a software ap- proach and a
hardware approach.
The most traditional and common software approach is scheduling by
an OS. However, reducing the execution time of other applications
is the only way to improve the performance of the QoS-aware
applications. With monitoring performance counters, IPC, etc., an
OS can sustain the fairness to some extent [3]. However, the
performance of each application tends to be degraded. Therefore, it
is hard to sustain the QoS by the scheduler.
Hardware approaches are more powerful than OS ap- proaches, one
being a cache partitioning [5] that divides the cache memory to
achieve a dedicated cache for each application. The dedicated cache
is effective in alleviating the interaction among applications.
However, each cache size thus decreases to less than the total
cache size, as a result of which, the performance is to an
unacceptable degree decreased [6]. To alleviate this problem,
dynamic cache partitioning [4], which adjusts the boundaries of a
cache, and virtual private caches [7], which control cache
bandwidth, have been pro- posed. Unfortunately, their effectiveness
is also limited.
A central problem in QoS resides in pipeline stalls due to
unexpected cache misses. So, some cache miss prediction mechanism
shows promise for sustaining QoS. For instance, Compaq Alpha 21264
[8] has a cache hit/miss predictor for a speculative issue
mechanism. If the cache is predicted as hit,
to minimize load-use latency, Alpha 21264 issues instructions
speculatively that depend on the previous load instruction. If the
speculation fails, all integer pipelines are rewound. The pipelines
are restarted and the instructions are reissued. To reduce this
overhead, the cache hit/miss predictor is very important for Alpha
21264.
Another approach is a selective instruction flush mecha- nism [9].
When a cache miss occurs on some thread, instruc- tions that depend
on the load instruction are removed from an instruction window to
avoid unnecessary resource occupation. After the cache is filled,
the removed instructions are refilled to the instruction
window.
III. M ICROARCHITECTURE OFOROCHI
Recent embedded devices that deal with high-quality mul- timedia
contents have a conventional processor (scalar pro- cessor) and a
media processor (ex. VLIW). The conventional processor usually
executes OS codes and miscellaneous low ILP applications. To
minimize developing time, exploiting conventional processors is
crucial, so many legacy codes and libraries are required to
complete the system. On the other hand, some media processor is
required to accelerate the media processing. There is much data
parallelism in multimedia applications, so typical media processors
employ an effective instruction set such as VLIW, SIMD, etc. that
can easily exploit data parallelism at a low hardware cost. We
considered that the legacy codes can be transformed to fit to some
VLIW structure to reduce the footprint of the total system. We
evaluated a heterogeneous SMT comprising ARM [10] architecture, as
one of the most popular embedded processors with de-facto OS, and
FR-V [11] architecture, as another popular embedded processor for
the image processing field.
FR550 is an eight issue FR-V architecture processor. FR550 can
issue four integer instructions and four floating point
instructions or media instructions simultaneously. The media
instructions support saturation operation, multiply and accu-
mulate, SIMD, and so on. Branch and load/store instructions are
classified as integer instruction. It can issue two branch
instructions simultaneously to support a three-way branch and also
two load/store instructions.
Figure 1 outlines the concept of OROCHI with a VLIW back-end
pipeline based on FR550. The most important differ- ence with a
popular pipeline is aVLIW queue that holds two different
instruction sets simultaneously. The key point of this structure is
that some empty slots always exist in the queue. Because the number
of function units of a VLIW processor is usually more than the
maximum number of instructions in one VLIW. As a result, even if
the VLIW instruction stream executes high performance multimedia
applications that occupy almost all of the instruction slots,
enough empty slots remain for execution legacy codes of ARM
applications or OS. Therefore, we considered that it is possible to
integrate the two different types of processors effectively without
performance degradation.
In detail, the back-end pipeline is comprised of instruction queue,
register file, execution units, data cache and I/O inter-
2
IA
IA
IF
BP
I1
Fig. 1. Pipeline of OROCHI
face. A different type of instruction set is translated to fit the
back-end pipeline. Several front-end pipelines are connected to the
instruction queue. Thus, some kinds of processors can be united
with small cost.
A. Outline of the pipeline
OROCHI has two front-end pipelines. Each front-end has an
instruction address generator (IA), an instruction fetch stage (IF)
with a cache (I1) and a branch predictor (BP) that includes a
load-use miss predictor described later, a decoder (ARM-D, HOST-D
corresponding to instruction decomposition similar to Intel P6
architecture [12] or Netburst architecture [13], and VLIW-D).
Additionally, ARM front-end has a rename stage (Rename) for
out-of-order execution. The decoded instruc- tions from VLIW-D are
directly enqueued into the left-most portion of the queue.
Meanwhile, the renamed instructions from Rename are scheduled to
the queue based on the data dependencies and the empty slots
(Schedule). The detailed mechanism of such scheduling is described
later.
The instructions, which have architecture flags to distin- guish
between ARM and FR-V instructions, in the queue are shifted toward
the execution units simultaneously when the instructions in the
right-most columns are all issued. The issue mechanism is very
similar to the popular VLIW architectures. Obviously, it is very
important to schedule instructions so that interlocks in the
instruction queue seldom occur because partial data dependency
interlocks the whole of the queue.
The back-end pipeline is based on VLIW, as mentioned, and includes
three integer units with shifter and partial multiplica- tion
functions (ALU), one load/store unit (OP1), one branch unit (BRC)
and four media units dedicated to FR-V instruction streams
(MEDIA)1. These function units are a subset of the FR550 processor.
All function units exceptMEDIA units are
1Floating point units are not included. ARM and FR-V use a
soft-float library instead.
shared by ARM and FR-V. The back-end pipeline also has a general
register file (GRF), which has eight read ports and five write
ports, and a media register (MRF), which has eight read ports and
four write ports. Since renaming is not necessary for in-order
execution of FR-V, only a logical register file is required for
FR-V. Even though logical register spaces are separated between ARM
and FR-V, the register file is shared so that the size of the
register file becomes large. However, numbers of read and write
ports are not increased. Since OROCHI does not have a register
transfer instruction between the general register and media
register, media register file is independent from general register
file.
As for ARM instructions, the results are written in the reorder
buffer out-of-order (WR) and then completed in-order in the
following retire stage (RETI). As for FR-V instructions, the
results are written in the architecture registers and also
completed in-order.
When branch prediction misses in a thread, the related instructions
are flushed from the front-end and the instruction queue, while the
other thread keeps executing the instruction stream.
B. QoS Aware Instruction Scheduling
In the conventional SMT, the requirements for QoS are not so strong
because the fairness between processes is the most important issue
for the system. Besides, in an embedded system area, special
considerations are required to maintain QoS for certain multimedia
applications.
Under typical usage of OROCHI, the processor executes both the
multimedia processing thread written in the VLIW instruction set
and the OS thread written in the conventional in- struction set
simultaneously. From the multimedia processing side, there are many
deadlines. The processor has to guarantee the completion of the
task before the deadline to meet the media QoS requirement.
3
Map/Schedule
l R
D
Fig. 2. Detailed Structure of OROCHI
The excessive method to maintain QoS of multimedia applications
running on FR-V is complete shutdown of ARM instruction streams.
However, this is not acceptable from the point of view of real-time
interruption handling.
It is known that the compiler for VLIW schedules instruc- tions
statically. If the compiler cannot find an instruction to fill
slots, empty slots are left as NOPs because the VLIW does not
schedule the instructions dynamically. Even if the ARM instructions
are inserted into such empty slots, performance drop never occurs
on the condition that ARM instructions do not interfere with the
instruction streams of FR-V applications. The most reasonable
technique to sustain the performance of FR-V is to provide enough
slots for FR-V and to schedule ARM into the unused slots.
Figure 2 describes the structure of instruction scheduling. In this
figure, we omit media and branch units to simplify. At first, if
the left-most portion is empty, FR-V instructions are enqueued into
the left-most portion of the queue. Then ARM instructions are
inserted into the queue. To find a suitable empty slot, the
scheduler compares the destination register numbers of instructions
that are already scheduled in the queue and source register numbers
of the instruction to be scheduled, then inserts into a suitable
slot nearest to the corresponding execution unit as possible. The
scheduling mechanism allows out-of-order execution with the
preceding rename stage and achieves comparable performance with
out-of-order super- scalar processors. After that, these mixed
instructions in the queue are issued to the common back-end unit in
the same manner as VLIW.
C. Issue Instructions
In the dispatch stage, VLIW hardware dispatches all of the
instructions in the right-most portion of the queue. If there is an
instruction that cannot be issued due to unresolved data
dependency, such as load-use that has a possibility of cache
miss, the following pipeline stages stall. When an L1 data cache
miss occurs, it stalls not only the dependent instructions but also
instructions in the same line. Such simple structure results in
lower complexity than superscalars that incorporate complicated
wakeup and select logics. Instead, performance seriously drops when
one of the instructions waits for the data produced by previously
dispatched instructions. The major event of such a stall derives
from a data cache miss. In the traditional instruction scheduling,
in order to greatly reduce the execution latency, the instructions
that require some load data are scheduled as if there were no cache
miss reported. Conversely, OROCHI should insert ARM instructions
without interference to FR-V. We only have a limited instruction
scope, so that there is a high probability of pipeline hazards due
to L1 data cache misses. Basically, OROCHI maintains QoS of the
FR-V application by scheduling ARM instructions carefully. The key
ideas of the mechanism are cache miss prediction and selective
instruction flush described in next section.
D. QoS Control with Cache Miss Prediction and Selective Instruction
Flush
To alleviate this pipeline stall problem, we propose a cache
hit/miss predictor and a selective instruction flush
mechanism.
In general, the cache miss predictor indicates whether the target
cache access will hit or miss. However, OROCHI has to control not
only where the depended instruction should be scheduled in the
queue but also when it should be scheduled. For instance, when a
cache miss is predicted, instructions that depend on the load data
should be scheduled apart from the load instruction. If we cannot
find a suitable free slot in the instruction window because a long
delay is predicted, the instruction should be delayed to schedule.
Such a mechanism has the potential to avoid pipeline stall due to
the cache misses, if it can learn cache behavior efficiently.
Conversely, if the prediction is incorrect, the processor cannot
avoid a pipeline stall. To alleviate this case, we propose an
additional selective instruction flush mechanism. When an ARM load
instruction results in a cache miss, all ARM instructions that
include the load instruction are purged from the instruction
window. Note that the cache fill request is not canceled. Since all
instructions have an architecture flag, it is easy to find the ARM
instructions. After that, the load instruction and the following
are scheduled again. With the mechanism, the pipeline stall is
eliminated and the FR-V instructions are executed without
interference from ARM.
Figure 3 outlines these techniques. In this figure, theLUMP
indicates a Load-Use Miss Predictor, which predicts whether an
instruction will bring cache miss or not. The LUMP is implemented
in the branch predictor. Note that at this stage, it is unknown
whether the instruction is a load or not. Instead, the hardware
cost is minimized by sharing the table with the branch predictor.
The additional information to the PHT of the Gshare branch
predictor is several bits that indicate the estimated delay cycles
to schedule. When a load instruction is scheduled, the scheduler
controls the insert point and the timing according to the
prediction (1a). For example, when
4
Fig. 3. LUMP & Selective Flush
the corresponding counter indicates 3, three cycles of cache delay
can be hidden, but if the cache access is hit, the three cycles
become a penalty. When a load instruction is executed, the
corresponding entry of the table is updated (1b). In detail, when a
load instruction leads to cache miss, the corresponding counter is
incremented and vice versa. If the selective instruction flush
mechanism is enabled also, the cache miss leads to flushing all of
the ARM instructions including the load instruction from the VLIW
queue (2).
IV. EVALUATION
We evaluate the multiple instruction set SMT processor OROCHI from
the view of IPC and the feasibility. First, the performance of the
VLIW queue is evaluated as compared with an out-of-order
superscalar processor. Second, SMT per- formance with both ARM and
FR-V applications is evaluated. Finally, the QoS features are
measured. Table I shows the basic parameters of OROCHI.
A. VLIW Queue
We preliminarily evaluate the performance of the VLIW queue as
compared with a superscalar processor using an RTL-level simulator.
We also design with ASIC (0.25µm technology) to evaluate the delay
and the area.
Also for the evaluation of the VLIW queue, we design another ARM
superscalar processor with a centralized instruc- tion window as a
baseline (ARMSS). Figure 4 outlines the baseline processor. The
fetch, decode, decompose and back- end units are the same as
OROCHI’s. However, ARMSS has a centralized instruction window in
order to support dynamic out-of-order execution. ARMSS also has
complicated Wakeup-Select logic. The Wakeup-Select logic searches
for instructions that are ready to be issued (Wakeup) and decides
which instructions are issued from the candidates (Select) within
one cycle. In Figure 4, we can find an additional
TABLE I EVALUATION PARAMETERS
Cache miss predictor PHT: additional 3bit× 8K entries (integrated
in the branch predictor)
Branch predictor PHT: 2bit× 8K entries (gshare)
Return Address Stack 8 entries Physical register 32 entries Store
buffer 8 entries Cache line size 64byte ARM I1 cache 4way,
16KB
miss latency 8cycle FR-V I1 cache 4way, 16KB
miss latency 8cycle D1 cache 4way, 32KB
miss latency 8cycle Unified L2 cache 4way, 2MB
miss latency 40cycle VLIW queue depth 4
TABLE II PERFORMANCE OFARM SUPERSCALAR(ARMSS)
IPC delay Freq IPC×Freq [ns] [MHz] [MIPS]
ARMSS 1.335 13.51 74.0 98.8(1.00) OROCHI 1.331 8.54 117.1
155.9(1.58)
large selector in the Select/Read stage. We compare IPC using
several programs from MiBench [14] running on ARM.
Table II shows the IPCs, the circuit delays and the over- all
performances. Table III shows the areas. From these results, OROCHI
outperforms ARMSS. The comparison of IPCs shows that ARMSS gains
only 0.3% over OROCHI. The comparison in the delay shows that
OROCHI is faster than ARMSS by 36.8% due to the simple instruction
issue mechanism. As a result, the overall performance of OROCHI is
expressed as the product of IPC and frequency is superior
5
S1 S2 D
Instruction Window
Wr it e By pa ss Tag- U pdat e D
D
TABLE III AREA OF ARM SUPERSCALAR(ARMSS)
Relative cell area
ARMSS 1.000 OROCHI (ARM only) 1.016
to ARMSS by 57.8%. The comparison in the area shows that sizes of
two implementations are almost the same. After that, OROCHI is
found to be an efficient implementation as an out- of-order
design.
B. Focus on IPC
The overall performance of the SMT is evaluated using an RTL-level
simulator that has a capability to run the real µClinux/ARM [15]
with no MMU. Some benchmarks from MiBench are compiled as an ARM
binary or an FR- V binary respectively and run simultaneously with
‘small’ datasets under the control of the OS. We select some
irregular applications (e.g.bitcount and dijkstra) for ARM and 13
media applications for FR-V. The average IPC is measured from the
point from which both programs start to the point at which the FR-V
program terminates while the ARM program is executed
repeatedly.
Figures 5 and 6 show the average IPCs, which includes a baseline
comprised of the total IPC of separated execution of ARM and FR-V
where some heterogeneous multi-core configuration is assumed. The
leftmost bars of each result show the baseline (oracle) IPCs that
correspond to simple summation of the IPCs of ARM and FR-V. The
rest of the bars of each show IPCs of SMT execution. Note that the
IPC of ARM includes the execution of OS codes.
With ARM bitcount, the IPC of FR-V achieves 98.3% of the ideal
performance and the IPC of ARM results in 73.4%. In the same
manner, with ARMdijkstra, the IPC of FR-V achieves 87.4% of the
ideal and the IPC of ARM results in 76.4%. These results clearly
show that OROCHI can successfully unite the two different types of
processors in a single pipeline.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
(c )
(d )
ri jn
s h a
)
)
)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
(c )
(d )
b lo
)
)
ri jn
s h a
)
)
)
Fig. 6. ARM dijkstra & FRV-MiBench
Considering the difference of the performance between ARM bitcount
and ARM dijkstra, some difference of data cache miss ratio is
observed. Although the miss ratio of the first level data cache
inbitcount is only 0.7%, the miss ratio in dijkstra is 5.5%. The
difference in memory pressure is considered to be a major reason
for the phenomenon.
C. Focus on QoS
In contrast to the assumption in the previous section, it is easily
imagined that the ARM programs with high memory pressure interfere
with the performance of FR-V. To alleviate this case, we propose
two key hardware mechanisms we call ‘cache miss predictor’ and
‘selective flush’ as mentioned before, which are more effective
than the software approaches.
For the comparison, an OS-based QoS mechanism inspired by a
previous work [3] is evaluated. In this mechanism, the process
scheduler in OS controls the priority of ARM programs so that the
FR-V programs maintain the performance to some extent.
Figure 7 shows the results.Oracle, Base, LUMP, Flush, LUMP+Flush
and OS Sched. correspond to oracle perfor- mances, without any
additional mechanism, using load-use miss prediction (LUMP),
selective flush, both the LUMP and the Flush, and the OS scheduler
respectively.TOTAL IPC shows the sum of theARM IPC and theFR-V IPC.
The Oracle and theBase are the same as the results in Figure
6.
6
0
0.5
1
1.5
2
2.5
3
F F T
b a si c m a th
b it c o u n t
b lo w fi sh (e )
b lo w fi sh (d )
d ij k st ra
is p e ll
jp e g (e )
jp e g (d )
sh a
1.2 1.4 1.6 1.8
1.2 1.4 1.6 1.8
Oracle
Base
LUMP
Flush
Fig. 7. QoS Assurance of FR-V (w/ ARMdijkstra)
With LUMP or Flush, the performance of FR-V (FR-V IPC)is increased
from 87.4% (Base) to 90.1% and 92.5% on average respectively.
Moreover, when both the LUMP and the Flush techniques are applied
(LUMP+Flush), it achieved 92.8% on average, whereas the amount of
the decrease in ARM performance (ARM IPC) corresponds to the amount
of the increase in FR-V performance. Consequently, the total
performance (TOTAL IPC) does not decrease at all. In addition,
using the OS scheduler (OS Sched.), the perfor- mance of FR-V
reached 92.5% on average. However, in order to achieve this
performance, the OS scheduler limits ARM execution time by 60.0%.
Therefore the performance of ARM is significantly decreased by
60.0% of Base and consequently the total performance is only 82.9%
as compared with our
hardware mechanism. After that, the result shows LUMP and the
selective flush
mechanism are efficient for sustaining QoS. In particular, the
latter can increase IPC of FR-V by 5.1%. Note that the total
performance is not decreased.
D. Feasibility Study
To evaluate the effectiveness of unification quantitatively, we
designed OROCHI using ASIC (0.25µm rule). Table IV shows the
comparison of several types of cores.OROCHI indicates the entire
area of OROCHI,OROCHI (FRV only) and OROCHI (ARM only) indicate
OROCHI without ARM front-end and FRV front-end respectively. The
differences of these results correspond to the size of the ARM
front-end
7
Configuration Relative cell area
OROCHI 1.000 OROCHI (FRV only) 0.668 OROCHI (ARM only) 0.859
ARM front-end 0.332 FRV front-end 0.141 Common back-end 0.527
(33.2%) and FR-V front-end (14.1%). The ARM front-end is twice as
big as the FRV front-end due to renaming and out-of- order
execution mechanisms. However, note that the difference of the area
is emphasized because of the small cache (L2 is not included) and
lack of floating point units as mentioned. If we make a
heterogeneous multicore using this front-end and back-end, the size
must be 152.7% due to redundant back-end; thus, OROCHI can reduce
the chip area by 34.5%. Assuming the same semiconductor technology,
OROCHI is comparable to only one SPE of a Cell Broadband Engine in
size.
V. CONCLUSIONS AND FUTURE WORK
In this paper, we proposed a heterogeneous SMT processor OROCHI
that can execute both a conventional instruction set and a VLIW
instruction set simultaneously.
By unification of the back-end pipeline that includes a load/store
unit, the processors based on different architecture can share
execution units and a data cache. Each processor has an opportunity
to use more cache area while the other processor does not need a
large cache area. First, we proposed a novel QoS aware instruction
scheduling mechanism with a VLIW queue. It schedules VLIW
instructions directly and also transforms conventional instructions
efficiently. The latter instructions are decomposed and inserted
into the empty slot of the VLIW queue. Second, we adopted a cache
miss prediction mechanism and a selective instruction flush
mechanism in a VLIW queue that are more effective than OS-based
QoS.
We evaluated the performance with an RTL-level simulator with
MiBench and OS. The result shows that the microar- chitecture can
achieve 98.3% of the ideal FRV performance and 73.4% of the ideal
ARM performance simultaneously when executing a light ARM process.
Even if it executes a heavy ARM process, the QoS is maintained by
92.8%
of FRV performance. As compared to a well-known QoS mechanism
controlled by a process scheduler in OS, this microarchitecture can
increase the total IPC by 20.7%. We also evaluated the chip area by
designing the microarchitecture on ASIC. The result shows that it
can successfully share back- end, which accounts for 52.7% of the
chip area. As a result, the microarchitecture can reduce the total
chip area by 34.5% compared to well-known separated multi-core
implementation.
As future work, we will measure the real power con- sumption of
OROCHI to evaluate the reduction of the power consumption, which
includes static power leakage.
ACKNOWLEDGMENT
This research is joint research with Semiconductor Technol- ogy
Academic Research Center and partially supported by the Ministry of
Education, Science, Sports and Culture, Grant-in- Aid for
Scientific Research (B), 19300012, 2006.
REFERENCES
[1] D. Phamet al., “The design and implementation of a first
generation cell processor,” inISSCC, 2005, pp. 184–592.
[2] J. A. Brown and D. M. Tullsen, “The shared-thread
multiprocessor.” in ICS, 2008, pp. 73–82.
[3] A. Fedorova, M. Seltzer, and M. D. Smith, “Improving
performance isolation on chip multiprocessors via an operating
system scheduler,” in PACT, 2007, pp. 25–38.
[4] J. Chang and G. S. Sohi, “Cooperative cache partitioning for
chip multiprocessors,” inICS, 2007, pp. 242–252.
[5] S. E. Raasch and S. K. Reinhardt, “The impact of resource
partitioning on smt processors,” inPACT, 2003, pp. 15–25.
[6] R. R. Iyer et al., “Qos policies and architecture for
cache/memory in cmp platforms,” inSIGMETRICS, 2007, pp.
25–36.
[7] K. J. Nesbit, J. Laudon, and J. E. Smith, “Virtual private
caches,” in ISCA, 2007, pp. 57–68.
[8] R. E. Kessler, “The alpha 21264 microprocessor,”IEEE Micro,
vol. 19, no. 2, pp. 24–36, 1999.
[9] A. R. Lebecket al., “A large, fast instruction window for
tolerating cache misses,” inISCA, 2002, pp. 59–70.
[10] ARM Architecture Reference Manual, ARM DDI0100E, ARM Limited,
2000.
[11] FR550 Series Instruction Set Manual Ver.1.1, FUJITSU Limited.,
2002. [12] L. Gwennap, “Intel’s p6 uses decoupled superscalar
design,”Micropro-
cessor Report, vol. 9, no. 2, pp. 9–15, 1995. [13] G. Hinton et
al., “The microarchitecture of the pentium4 processor,” in
Intel Technology Journal, Q1, 2001. [14] M. Guthaus et al.,
“Mibench: A free, commercially representative
embedded benchmark suite,” inIEEE 4th annual International Workshop
on Workload Characterization, 2001.
[15] “µclinux,” http://www.uclinux.org/.
Experiences with Numerical Codes on the Cell Broadband Engine
Architecture
Markus Sturmer, Daniel Ritter, Harald Kostler, and Ulrich Rude
System Simulation Group
Department of Computer Science University Erlangen-Nuremberg
Cauerstraße 6, 91058 Erlangen
[email protected]
Abstract
Many numerical computations in science and engineer- ing require
high memory bandwidth and computational power. The Cell Broadband
Engine Architecture (CBEA), a heterogeneous multicore architecture,
promises both. We evaluated the potential of the CBEA for numerical
codes in the areas image processing, computational fluid dynamics,
and molecular dynamics. We present results and derive the strengths
and challenges for using this novel architecture.
Keywords: CBEA, Cell processor, performance opti- mization, image
processing, computational fluid dynamics, molecular dynamics
1. Introduction
Multicore architectures are the current trend to serve the
insatiable demand for computational power in science, en-
gineering, economy, and gaming. In contrast to other chip designers
that put multiple, basically identical cores on a chip, STI1 took a
different approach with their Cell Broad- band Engine Architecture
(CBEA) that promises outstand- ing performance by establishing a
heterogeneous design, whose key concepts are outlined in Sect. 2.
The first ma- chine to break the Petaflop barrier in Linpack was
built of 12,960 PowerXCell 8i, the latest implementation of the
CBEA, and 6,480 AMD Opteron processors at the Los Alamos National
Laboratory.
To explore the potential of this novel architecture for nu- merical
applications, we describe performance-optimized implementations on
the CBEA for applications in im- age processing (Sect. 3),
computational fluid dynamics
1Sony, Toshiba and IBM
(Sect. 4), and molecular dynamics (Sect. 5) before recapit- ulating
the special features of the architecture in Sect. 6.
2. Architectural overview
The first implementation of the CBEA, the so-called Cell Broadband
Engine (Cell/BE) is used e. g. in the Sony PlaystationTM 3 game
console and IBMs QS20 and QS21 blades. Its organization is depicted
in Fig. 1 [5, 6]: The backbone of the chip is a fast ring bus—the
Element Inter- connect Bus (EIB)—connecting all units on the chip
and providing a throughput of up to 204.8 GB/s in total when
running at 3.2 GHz. A PowerPC-based general purpose core—the Power
Processor Element (PPE)—is primarily used to run the operating
system and control execution, but has only moderate performance
compared with other gen- eral purpose cores. The Memory Interface
Controller (MIC) can deliver data with up to 25.6 GB/s from Rambus
XDR memory and the Broadband Engine Interface (BEI) pro- vides fast
access to I/O devices or a coherent connection to other Cell
processors. The computational power resides in eight Synergistic
Processor Elements (SPEs), simple but very powerful co-processors
consisting of three compo- nents: Synergistic Execution Unit (SXU),
Local Storage (LS), and Memory Flow Controller (MFC).
The SXU is a custom Single Instruction Multiple Data (SIMD) only
vector engine with a set of 128 128-bit-wide registers and two
pipelines. It operates on 256 kB of its own LS, a very fast,
low-latency memory. SXU and LS consti- tute the Synergistic
Processor Unit (SPU), which has a ded- icated interface unit
connecting it to the outside world: the primary use of the MFC is
to asynchronously copy data be- tween LS and main memory or the LS
of other SPEs using Direct Memory Access (DMA). It also provides
communi- cation channels to the PPE or other SPEs and is utilized
by the PPE to control execution of the associated SPU. Each
9
SPE can be seen as a very simple computer performing its own
program, but dependent on and controlled by the PPE.
The Cell/BE is able to perform 204.8 GFlop/s using
fused-multiply-adds in single precision (not counting the abilities
of the PPE), but is limited regarding double preci- sion. Only six
SPEs are available under Linux running as a guest system on the
Sony PlaystationTM 3, what reduces the maximum performance there
accordingly to 153.6 GFlop/s. The newer PowerXCell 8i [7], used in
IBMs QS22 blades, differs from the older Cell/BE by SPEs with
higher per- formance in double precision (12.8 instead of 1.8
GFlop/s each) and a converter that allows connecting DDR2 mem- ory
to the MIC.
Figure 1. Schematic view of the STI Cell Broadband Engine.
While standard PowerPC software and compilers can be executed on
the PPE’s computation unit (the PowerPC Pro- cessor Unit, PPU),
software must be adapted to take advan- tage of the SPEs, whose
SXUs use their own instruction set. The basic approach to write
CBEA-enhanced software is to separately implement the parts running
on PPE and the SPEs, where libraries and language extensions help
in issu- ing and waiting for DMA transfers, and doing the commu-
nication and synchronization between the different agents. From a
software perspective, a program running on the PPU acquires an SPE
and loads a code image to its LS first. To actually start the
program on the SPE, a system call is used, which does not return
until the SPE code suspends execu- tion or terminates.
There are several general or Cell-specific approaches to ease the
creation of heterogeneous parallel software, like IBM’s Accelerated
Library Framework (ALF) and Data Communication and Synchronization
(DaCS) library, Cell Superscalar (CellSs) by the Barcelona
Supercomputing Center, the RapidMind Multi-Core Development
Platform, or Mercury’s Multicore Plus SDK, only to mention some of
them.
3. Image processing
Image processing is one of the applications for which the Cell/BE
is especially suitable. Images have naturally regu- lar data
structures and are processed using regular mem- ory accesses so
that data can be transferred easily by DMA. Additionally, single
precision is usually sufficient for im- age processing tasks.
Besides the traditional techniques for image and video compression
based e. g. on wavelets and Fourier transforms, methods using
partial differential equa- tions (PDEs) have been developed. These
methods have the potential for providing high quality, however they
are usu- ally very compute intensive.
The PDE-based video codec PDEVC [10] is conception- ally very
simple. For each picture, typically 10–15% of the pixels of an
image are selected and stored. All remaing pixels are discarded and
must therefore be reconstructed in the decoding stage. We will not
discuss the algorithms for selecting the so-called landmark pixels,
but will rather fo- cus on the core algorithm used in the
reconstruction phase, when the landmarks c and the corresponding
pixel values u0 are given. Filling in the missing pixels is the
so-called inpainting problem [3], which is modeled by a partial
dif- ferential equation of the form
,
where the diffusion tensor Duσ can be one of the three choices in
order of increasing complexity
• homogeneous diffusion (HD),
• nonlinear anisotropic diffusion (NAD).
Examples of reconstructions are shown in Fig. 2. Homo- geneous
diffusion has a tendency to smoothen edges in the images, but leads
to the least costly algorithm. The nonlin- ear variants attempt to
preserve edges better by adjusting the diffusion tensor to the
local image features. The NAD regu- larizer is currently state of
the art, as it is best in preserving edges, but is the
computationally most expensive one.
The three color channels of an RGB image are encoded separately and
solving an equation is necessary for each of them. Typically, a
frame rate of about 25 frames per second (FPS) is necessary to
achieve smooth real-time playback.
The PDEVC-player is a typical multi-threaded applica- tion: One
thread interprets the video file and sets up the necessary data
structures in main memory. Multiple decom- pressor threads produce
one video frame at a time by solv- ing the associated PDE(s)
approximately. Another thread is responsible for displaying. Two
ring-buffers are necessary to synchronize the data flow.
10
Figure 2. Comparing the three different kinds of diffusion.
In the CBEA-optimized version of the player, the decom- pressor
threads off-load the numerical work to an associ- ated SPE.
ω-red-black Gauss-Seidel (ωRBGS) solvers are used for the HD and
NID regularizers, and a damped Jacobi (JAC) for NAD. More complex
solvers, like multigrid meth- ods that are typically used for these
types of PDEs, give only small improvement due to the high density
of landmarks. Especially JAC is suitable for processing in SIMD,
but care must be taken to preserve landmarks where known pixels are
given. This is achieved by first calculating a whole SIMD vector
containing four new single precision results, regardless of the
pixel types. The final result, that will be written back to the
Local Storage, is created by selecting from the previous and
updated values depending on a bit field describing the landmarks in
the current frame. The SPU ISA allows for performing this very
efficiently. The kernels are implemented using intrinsics, because
the com- piler failed in vectorizing and unrolling the loops
automati- cally.
For the image sizes investigated, data from multiple im- age rows
can be held in a LS, so that blocking techniques re- duce the DMA
transfer with main memory drastically. The ωRBGS solvers perform a
whole iteration, JAC two itera- tions per sweep as described in
[8]. Table 1 shows the frame rates that are achievable on a Sony
PlaystationTM 3 when all six available SPEs are used. These values
do not include the
bandwidth and effort of the PPE for reading the file and set- ting
up the necessary data structures.
The ωRBGS implementations use the same approach as for preserving
landmarks to update only every second un- known, so internally
twice the computations need to be per- formed. From the different
types of diffusion tensors, HD leads to a simple five-point stencil
for the Laplace opera- tor with fixed coefficients and therefore
has a low compu- tational density of 6 Flops per iteration and
unknown. The NID regularizer is also approximated by a five-point
sten- cil, but the coefficients are recomputed before each update,
requiring 29 Flops per update in total. The highest com- putational
density occurs when nonlinar anisotropic NAD- tensors are used,
since they result in a nine-point stencil, whose coefficients are
updated every second iteration, re- sulting in 39.5 Flops per
update on average.
Only image data needs to be transferred (4 Byte per pixel and
color), since coefficients are calculated on-the-fly on the SPEs.
Decoding a single frame using one SPE generates about 120 MB main
memory traffic per color frame for the examples in the table.
Table 1. Decompression speed of pdevc. Measured for a resolution of
320×240 pixels. 130 iterations of JAC for NAD or 65 ωRBGS it-
erations for NID and HD with 10% landmarks were used to obtain
comparable times.
regularizer FPS bandwidth computation HD 101 12 GB/s 8.2 GFlop/s
NID 48 5.8 GB/s 18 GFlop/s NAD 34 4.1 GB/s 36 GFlop/s
It can be seen that only the HD regularizer has extraordi- nary
bandwidth requirements. To interpret the GFlop rates correctly, it
should also be noted that many computations actually performed were
not accounted for: the NID kernel reaches impressive 42% GFlop/s
internally, but most results are discarded due to the
SIMD-vectorization of the ωRBGS method or because they are
landmarks.
4. Computational fluid dynamics
Computational fluid dynamics (CFD) has a large num- ber of
applications in science and engineering. Besides classical
Navier-Stokes solvers, lattice Boltzmann methods (LBM) have become
an interesting alternative. LBM use an equidistant grid of cells,
so-called lattice cells, that in- teract only with their direct
neighbors. However, both ap- proaches are computationally very
expensive, and single computers often do not provide the necessary
performance to get results in reasonable time. LBM seem to be
especially
11
suitable for the CBEA due to their simple access patterns, higher
computational density, and trivial parallelization on shared memory
machines.
cellbm[11] is a prototype LBM solver based on [4] that has been
designed especially for the CBEA and uses the common D3Q19 BGK [1,
12] collision model. Its main motivation was to explore the
feasibility of blood flow sim- ulation with the related problems—e.
g. the complex blood vessel structures— while using specialized
hardware effi- ciently. Single precision was used, since only
Cell/BE hard- ware with slow double precision was available during
its implementation.
The memory layout is a key to efficiency and good per- formance. To
save memory, the whole domain is divided into so-called patches of
8 × 8 × 8 lattice cells in size, and only patches containing fluid
lattices are actually allocated. This allows efficient processing
of moderately irregular do- mains, while providing all premises for
good performance on the CBEA.
The layout allows for efficient data movement to and from the SPEs,
transfers of multiple 128-Byte-blocks— corresponding to cache lines
of the PPE—with natural alignment in both, main and local storage,
result in optimal DMA bandwidth. Besides the patch itself, data
from outer edges of possible neighbors needs to be fetched. To
avoid the related inefficient gather operations, a copy of these
faces and lines is reordered and stored contiguously while
processing a patch, and can be retrieved from its neighbors in the
next time step easily. Using two buffers for these copies, patches
can be processed independently and in any order, so the
parallelization is trivial and patches can be as- signed
dynamically to the SPEs using atomic counters.
Patch data is stored in a structure-of-arrays manner, so all
computations can be done in a SIMD way with as many 16 Byte vectors
being naturally aligned as possible. SPEs must emulate scalar
operations by performing SIMD op- erations and combining the
previous SIMD vector and the SIMD vector containing the desired
modification, which makes them extraordinary expensive.
Furthermore, loading or writing naturally aligned 16 B vectors are
the only mem- ory operations to the LS the SPU supports natively;
two aligned loads and a so-called shuffle instruction that extracts
the relevant part are necessary to emulate an unaligned load.
Branches may lead to long branch miss penalties on the SXUs and are
inherently scalar, so the implementation avoids them wherever
possible. Conditional computations are vectorized by computing both
possible results in SIMD and creating a select mask according to
the condition. The resulting SIMD vector is obtained by combining
the differ- ent variants according to the mask using a select
instruction. The SPU ISA provides various operations for efficient
mask generation to perform that efficiently.
Table 2 compares performance of a serial lattice Boltz-
mann implementation written in C running on various processor types
and our SIMD-optimized implementation mainly written in SPU
assembly language to demonstrate the importance of SIMDization on
the SPUs. The typical means of expressing LBM performance is the
number of lattice site updates or fluid lattice sizes updates per
second (LUP/s and FLUP/s). A single FLUP corresponds to 167
floating point operations in the optimized SPU kernel. The codes
purely run from the CPUs’ L2 caches or the SPU’s LS, respectively.
It can be seen that the PPE cannot keep up with a modern server
processor, but performance on the SPU is worst due to the huge
overhead of performing scalar operations and branches. Advanced
compilers may vector- ize simpler scalar algorithms, but they
cannot employ SIMD in the LBM program yet.
Table 2. Performance of a straight-forward single precision LBM
implementation in C on an Intel Xeon 5160 at 3.0 GHz, a standard
3.2 GHz PPE and SPU, compared with the opti- mized SPU-kernel for
an 83 fluid lattice cells channel flow.
straight-forward C optimized CPU Xeon PPE SPU SPU MFLUP/s 10.2 4.8
2.0 49.0
There are two approaches for coping with the cache- coherent
non-uniform memory access (ccNUMA) topology on the IBM QS blades
that provide two Cell processors with an attached main memory and a
fast interconnect between the processors. The simpler approach is
to allocate all data pagewise alternating on both memory locations,
so that a SPE on any CPU will access memory through the nearby and
the remote memory bus. Distributing half of the patches to each
memory location and the proximate SPEs allows for optimizing for
NUMA even better.
Table 3 shows the performance of the whole LBM solver on a
PlaystationTM 3 and a QS20 blade with different SPE and CPU
utilization. Generally, it can be seen that well opti- mized
kernels are able to saturate the memory bus with half of the SPEs
available.
When looking at one or two SPEs running on a single CPU, the
PlaystationTM 3 gets a slightly better performance. On the QS20,
the coherence protocol between the two CPUs leads to a lower
bandwidth achievable for a single SPE. Memory benchmarks have shown
that this is especially true for DMAs writing to main
storage.
Both approaches for exploiting the NUMA architecture when utilizing
the second CPU and its memory bus can im- prove performance
significantly with an efficiency of 79% and 93%, respectively. If
e. g. four Cell processors might
12
Table 3. Cell/BE MLUP/s performance for a 963 channel flow. MFLUP/s
= MLUP/s · 943
963 .
PS3 QS 20 CPUs one one both both memory local local interleaved
NUMA-
aware 1 SPE/CPU 42 40 73 70 2 SPEs/CPU 81 79 129 136 3 SPEs/CPU 93
107 156 189 4 SPEs/CPU 94 110 166 204 6 SPEs/CPU 95 110 174 205 8
SPEs/CPU N/A 109 173 200
be connected in the future, the efficiency of the simple approach
that distributes data blindly will decrease drasti- cally. For
applications like the LB method, that are memory bound and whose
work can be distributed easily, manual management of data and its
memory locations is worthwhile anyway.
5. Molecular dynamics
Molecular dynamics (MD) is another field where the out- standing
numerical performance of the CBEA can be of use. One possibility to
solve MD problems with a large num- ber of particles and long-range
interactions between those effectively are grid-based methods,
which are explained in [9]. These methods require fast linear
solvers, e. g. multi- grid methods. They can be parallelized on a
shared mem- ory system with moderate effort and its high
floating-point performance and bandwidth make the CBEA a highly
inter- esting architecture for this class of algorithms.
A common issue in MD is the solution of Poisson’s equa- tion on an
unbounded domain, i. e. with open boundary con- ditions. For 3D,
this problem can be written as
Φ(x) = f(x), x ∈ R3,
with Φ(x)→ 0 for x → ∞, where supp(f) ⊂ is a bounded subset of R3.
For nu- merical treatment, the equation is discretized, which leads
to the following formulation:
hΦ(x) = f(x), x ∈ {x|x = h · z, z ∈ Z3}, with Φ(x)→ 0 for x →
∞,
with the discrete Laplace-operator h and mesh size h. This equation
is still an infinite system, what prevents the direct numerical
solution. For that reason, the system is reduced to a finite one
using a stepwise coarsening and ex- panding grid hierarchy of l
levels (Gi, i = 1, . . . l) as de- scribed in [2]. The expanding
and coarsening leads to the
fact that the number of grid points is not halved in each dimension
from one level to the next one, but decreasing slower (compare to
Table 4). The values of Φ on the bound- ary points of the coarsest
grid are calculated as
Φ(xδ) = 1
.
This evaluation is only sensible for a small number of boundary
points because of its high cost. The solution with a multigrid
method is supported by that hierarchical grid structure. From the
class of multigrid methods, the Fast Adaptive Composite Grid method
(FAC) is used, which restricts the original system equation and not
the residual equation.
The FAC was implemented using a Jacobi smoother for pre- and
postsmoothing, direct injection for restriction, and linear
interpolation for prolongation. The program was par- allelized on
the Cell/BE using domain decomposition. To enhance the execution
speed of the code, several optimiza- tion techniques were
applied:
• SIMDization of the computation kernels: All the oper- ations such
as restriction, smoothing, interpolation and treatment of
interfaces use the SPE vector operations.
• Linewise processing of the data using double buffer- ing: Each
line is 128-Byte-aligned in the Local Stor- age and the main memory
to utilize full memory band- width.
• The interfaces between two grid levels need special
considerations and are treated using a ghost layer, which avoids
the access to both grids at the same time.
After a smoothing step and before the restriction is done, all
threads are synchronized to avoid data inconsistencies.
Tests were performed both on the PlaystationTM 3 and on the IBM
QS20 for different grid sizes. The memory requirements for some of
those are specified in Table 4. Since the results are very similar
on the PlaystationTM 3 and the QS20, but the QS20 enables more
opportunities because of its bigger main memory and more SPEs, only
the test runs on the QS20 are considered here. The first tests were
run using one Cell processor only. Exemplary for the performance of
the adapted computation kernels, the runtime of the Jacobi smoother
was analyzed. This has been done using exact in-code timing
commands at each synchronization event, i. e. after each iteration
of the smoother. The timing results for different numbers of
13
Table 4. Overview of the four finest grid sizes, total number of
levels, and memory require- ments of the FAC method.
# grid size in each dim. on level memory levels 1 2 3 4 [MB]
8 35 35 35 23 8 12 67 67 39 35 26 16 131 131 131 71 159 20 195 195
103 99 504
Table 5. Runtimes (in msecs) for one Jacobi iteration depending on
grid size and number of threads.
problem size 643 1283 1923 2563
1 SPE 1.56 10.1 31.2 70.6 2 SPEs 0.78 5.03 15.6 35.3 3 SPEs 0.63
3.36 10.5 23.6 4 SPEs 0.46 2.55 8.14 17.9 5 SPEs 0.39 2.17 7.25
15.8 6 SPEs 0.34 1.96 6.28 13.8 7 SPEs 0.32 1.81 6.40 13.5 8 SPEs
0.25 1.78 6.08 13.8
unknowns are shown in Table 5.
The question of interest is, whether the memory band- width or the
floating-point performance is the limiting fac- tor in terms of
speed. The first can roughly be computed by Pmem =
(size−1)3·20
time , as 20 Byte have to be trans- ferred per inner grid point,
while the latter is given as Pflop = (size−1)3·10
time , since 10 numerical operations are ex- ecuted per inner grid
point. Fig. 3 shows both measures for the previous test runs.
The performance of the Jacobi smoother is basically bound by the
memory bandwidth. For up to six SPE threads, scaling of speed is
almost ideal, for seven and eight there is hardly any effect, since
the memory bus is already saturated. The highest measured value is
22.7 GiB/s.
Additionally, experiments were performed on the QS20 distributing
the threads to both processors and an inter- leaved memory
strategy. This strategy allocates memory pages alternating to the
two memory buses. So twice the memory bandwidth compared to the
default strategy is pos- sible in theory. Practically an
improvement of up to 26.8% is gained, as shown in Table 6. The
outcome of an advanced memory strategy increases with the number of
active SPEs, i. e. for future setups with more processors,
exploiting the NUMA architecture more diligently will be
crucial.
1 2 3 4 5 6 7 8 64³ 1.54 3.07 3.79 5.14 6.18 6.96 7.37 9.54 128³
1.99 3.98 5.96 7.83 9.21 10.19 11.07 11.26 192³ 2.2 4.39 6.52 8.43
9.47 10.93 10.72 11.29 256³ 2.32 4.64 6.93 9.15 10.4 11.89 12.18
11.88
12.8 12.8 12.8 12.8 12.8 12.8 12.8 12.8Peak bandw.
1 2 3 4 5 6 7 8 1
3
5
7
9
11
2
6
10
14
18
22
64³ 128³ 192³ 256³ Peak bandw.
Fl oa
tin g-
po in
M em
or y
B an
dw id
th [G
iB /s]
Figure 3. Floating-point performance and memory bandwidth of Jacobi
smoother on the QS20.
Table 6. Memory throughput of the Jacobi smoother for grid size
1923 when using one or both memory buses in GiB/s.
memory one bus interleaved relative strategy speedup 1 SPE 4.1 4.1
0% 2 SPEs 8.18 8.20 0.24% 3 SPEs 12.1 12.2 0.25% 4 SPEs 15.7 16.2
3.1% 5 SPEs 17.6 20.4 15.5% 6 SPEs 20.4 23.0 13.0% 7 SPEs 20.0 25.3
26.8% 8 SPEs 21.0 26.7 26.8%
6. Conclusions
We have demonstrated the potential of the CBEA for sev- eral
scientific applications and shown that bandwidth and computational
power near to the theoretical peak perfor- mance is achievable.
However, big efforts are necessary to accomplish that. The
complexity is only partially caused by specific features of this
architecture.
Splitting the task into smaller subtasks and handling syn-
chronization and communication between multiple agents becomes
increasingly important since the advent of mul- ticore systems.
Heterogeneous architectures only increase complexity in the way
that a subtask must fit the abilities of the core type it is
executed on.
SIMD is a concept that is very common today, as it is the most
efficient way to exploit wide buses and data level parallelism
without much complicating the control
14
logic. The SPU ISA consequently makes SIMD the default case and
adds another penalty to performing scalar oper- ations. Similarly,
data alignment influences performance on all advanced platforms.
Alignment of scalar and SIMD data in memory is restricted on most
platforms, or result in decreased performance if not appropriate.
However, the discrepancy of performing well aligned SIMD and badly
aligned scalar operations on an SPU is unmatched.
The concept of Local Storage, that is managed by copy- ing data to
and from main memory via asynchronous DMAs, is perhaps the only
concept not met in common gen- eral purpose architectures at all.
It allows for covering long main memory latencies exceptionally
well without using in- creasingly complex out-of-order cores. On
the downside, exact knowledge of the working set and its management
is necessary, not mentioning the complexity of distributed,
parallel modifications of it. An analogy found on standard
cache-based architectures might be the necessary overview of the
current working set when using cache blocking tech- niques, but
there it affects only performance and is only rel- evant for hot
spots.
The question remains how much performance can be preserved if one
switches to higher-level programming ap- proaches to increase
productivity. Since the emphasis of all projects was on how much
performance is feasible, this will have to be examined in the
future. There is no doubt that libraries and frameworks can ease
communication, data partition and movement. But as all general
approaches rely on established high-level language compilers, the
problem of optimizing numerical kernels in computationally bound
applications can be expected to remain.
References
[1] P. Bhatnagar, E. Gross, and M. Krook. A Model for Col- lision
Processes in Gases. I. Small Amplitude Processes in Charged and
Neutral One-Component Systems. Phys. Rev., 94(3):511–525,
1954.
[2] M. Bolten. Hierarchical grid coarsening for the solution of the
poisson equation in free space. Electronic Transactions on
Numerical Analysis, 29:70–80, 2008.
[3] I. Galic, J. Weickert, M. Welk, A. Bruhn, A. Belyaev, and H.
Seidel. Towards PDE-based image compression. In Proceedings of
variational, geometric, and level set meth- ods in computer vision,
Lecture Notes in Computer Sci- ence, pages 37–48. Springer-Verlag,
Berlin, Heidelberg, New York, 2005.
[4] J. Gotz. Simulation of bloodflow in aneurysms using the Lattice
Boltzmann method and an adapted data structure. Technical Report
06-6, Department of Computer Science 10 (System Simulation),
Friedrich-Alexander-Universitat Erlangen-Nurnberg, Germany,
2006.
[5] IBM. Cell Broadband Engine Architecture, Oct. 2007. [6] IBM.
Cell Broadband Engine Programming Tutorial, Oct.
2007.
[7] IBM. Cell BE Programming Handbook Including PowerX- Cell 8i,
May 2008.
[8] M. Kowarschik. Data Locality Optimizations for Iterative
Numerical Algorithms and Cellular Automata on Hierarchi- cal Memory
Architectures. PhD thesis, Friedrich-Alexander- Universitat
Erlangen-Nurnberg, Germany, Jun 2004. Ad- vances in Simulation
13.
[9] M. Griebel, S. Knabek, S. Zumbusch, S. Caglar. Numerische
Simulation in der Molekulardynamik. Springer, 2003.
[10] P. Munch and H. Kostler. Videocoding using a varia- tional
approach for decompression. Technical Report 07- 1, Department of
Computer Science 10 (System Simula- tion),
Friedrich-Alexander-Universitat Erlangen-Nurnberg, Germany,
2007.
[11] M. Sturmer, J. Gotz, G. Richter, A. Dorfler, and U. Rude.
Fluid Flow Simulation on the Cell Broadband Engine using the
Lattice Boltzmann Method. Accepted for publication in the
proceedings of the Fourth International Conference for Mesocscopic
Methods in Engineering and Science, 2007.
[12] S. Succi. The Lattice Boltzmann Equation - For Fluid Dy-
namics and Beyond. Clarendon Press, 2001.
15
16
A Realtime Ray Casting System for Voxel Streams on the Cell
Broadband Engine
Valentin Fuetterling Fraunhofer ITWM
Email:
[email protected]
Abstract—In this paper we introduce a volume rendering system
designed for the Cell Broadband Engine that only requires a minimum
of two voxel slices at a time to perform image synthesis of a
volume data set for a fixed number of user defined views. This
allows rendering volume data in a streaming fashion and makes it
possible to overlap rendering with data acquisition.
Imagine a screening line at the airport where luggage is examined
with an x-ray machine. As luggage passes through the scanner
multiple x-ray samples are taken from which 2D voxel slices are
derived. These finally form a full volume data set that needs to be
displayed for quick analysis. Traditional software volume rendering
systems are impractical for such a task as they require the volume
data set to be fully available for image synthesis and thus need to
wait until the x-raying process has finished.
Our solution is better suited for the depicted situation and
related problems as it is able to perform time-critical rendering
in parallel with volume generation.
I. INTRODUCTION
Volume visualization requires information to be extracted from a 3D
scalar field to form a single color value that can be displayed.
This mapping can be performed by a maxi- mum/average intensity
projection or by evaluating the Volume Rendering Integral [10]
which in its discretized form can be computed iteratively with the
over operator [12]. In practice the 3D scalar field usually is
represented by a uniform grid that is sampled multiple times in
order to compute the Volume Rendering Integral or other mappings
for every pixel of a viewing plane. Methods that can be used for
the sampling pro- cess are described in section II. The sampling
rate necessary to achieve acceptable results is determined by the
Nyquist- Shannon sampling theorem [13]. A huge number of samples
need to be taken to visualize a data set resolution of 5123 voxel
or higher which makes volume rendering a compute intensive task.
Optimization strategies exist [2], however most of them rely on a
pre-process that requires the full volume data set to be analyzed
prior to rendering. Thus these strategies are not applicable to
volume data sets that are not fully existent at the beginning of
the image synthesis process.
Another reason that favors a brute-force approach is its constant
runtime characteristic considering a constant volume set resolution
as execution time does not depend on the actual volume data which
is changing frequently. This is a property often required in our
targeted area of application. Using a brute-force solution special
purpose hardware is easily designed and very efficient so it is
commonly used in todays
time critical systems as depicted in the abstract. However special
purpose hardware is expensive and inflexible by nature.
We will show that our flexible software approach tailored to the
hardware architecture of the Cell Broadband Engine (CBE) is capable
of rendering an emerging volume data set ’just in time’ from
arbitrary view directions and thus delivers the necessary
performance for real-time volume data inspection.
II. VOLUME SAMPLING METHODS
For volume sampling object-order, image-order and hybrid methods
exist.
Texture slicing is a popular object-order method for interac- tive
volume visualization. Voxel slices are mapped to polygons that are
transformed by projection and blended together in correct order to
form the final image [11]. By design this method produces low
quality images and requires three sets of voxel slices, each
orthogonal to one of the major axes.
A widely used hybrid method is the shear-warp algorithm. Voxel
slices are first sheared and projected onto an interme- diate plane
aligned with the volume which is finally warped onto the viewing
plane [8]. The image quality suffers due to the warp process and
three sets of voxel slices are required for this technique as
well.
As both methods introduced so far demand the full volume data set
to be available they are obviously impractical for voxel streaming.
An image-order method that does not share this handicap and
provides high quality images is ray casting. For each pixel of the
view plane a ray is cast into the volume and multiple samples are
evaluated along the ray [7]. As each ray can be processed
independently this algorithm is very flexible. The streaming model
we will introduce in the course of this paper depends on a flexible
and easy sampling method in order to be efficient so we decided for
ray casting.
III. CELL BROADBAND ENGINE
The CBE comprises one Power Processing Element (PPE) and eight
Synergy Processing Elements (SPE) which are attached to the Element
Interconnect Bus (EIB) together with a Memory Interface Controller
(MIC) that provides Direct Memory Access (DMA). The PPE is
comparable to an ordi- nary PowerPC and can offload compute
intensive tasks to the SPEs. Each SPE features a Memory Flow
Controller (MFC) for data transfer and communictaion, a Local Store
(LS) which is 256KB in size and a unified register set with 128
registers.
17
Fig. 1. The two green slices form the slab a) that is currently
being sampled, the red slices have already been processed and are
no longer available, the yellow slice is being generated by the
scanning device and the slices in grey will be produced in the near
future. The arrows b) and c) indicate a sub-slice and a sub-slice
level respectively. The drawing d) symbolizes the object being
scanned.
Each register is 128 bit wide and has SIMD capabilities similar to
the AltiVec ISA of the PPE (for more general information on the CBE
see [4]).
Communication between the PPE and SPEs can be accom- plished by a
mailbox mechanism provided by the MFC that allocates an in-tray for
each SPE where 32 bit messages can be written to by the PPE or
other SPEs. The in-trays work like FIFO queues with a capacity of
four messages. A SPE can check its in-tray for new messages at any
time. If no new messages are available it can stall until the next
message arrives.
In order to process a chunk of data a SPE must initiate an
asynchronous DMA transfer to fetch it from main memory into its LS.
When more than one continuous chunk of data is required an
efficient DMA list transfer can be initiated. The DMA controller
utilizes a list to gather multiple chunks from main memory into the
LS. The list can also be used to scatter LS data back to memory.
Lists must be located in the SPEs LS and each list element must
consist of 8 bytes providing the chunk’s effective address in main
memory and its size in bytes (figure 2).
A DMA transfer will always transmit at least one cache line of 128
bytes. Thus bandwidth is maximized by using addresses and transfer
sizes that are a multiple of 128 bytes.
IV. STREAMING MODEL
The Streaming Model is based on the observation that for rendering
a volume data set all sampling positions for each ray remain
constant if the view and the volume data set resolution do not
change. However the actual volume content can be altered
arbitrarily as it does not affect the sampling positions. We will
refer to such a combination of a constant volume set resolution and
a number of constant views from which the volume set is rendered as
a configuration.
A. Voxel Streams A Voxel Stream of a volume data set consecutively
delivers
packets of voxel data that are ordered in respect to time so
} ;
} ;
} ;
Fig. 2. Data structures. One list element (8 byte) can reference up
to 255 ray packets (128 bytes each).
that their relative positions within the volume data set are known
in advance. In practice one packet amounts to one voxel slice and
the packet ordering in time is equivalent to the slice ordering
along a major axis of the volume data set. From now on we will
assume this axis to be the z-axis (figure 1). A voxel stream of a
volume data set can be easily rendered using a related
configuration. For each ray the information which voxel slices are
required and when these voxel slices will be available can be
precomputed and used at runtime for efficient sampling. The
direction of a ray along the z-axis determines whether it traverses
the volume front-to-back or back-to-front. For both cases
compositing methods exist to compute the Volume Rendering Integral
along the ray [12].
List Headers
List Elements
Ray Packets
Main Memory
Fig. 3. Illustration of the memory reference hierarchy. The list
headers point to continuous blocks of list elements which in turn
reference continuous blocks of ray packets inside the ray buffer.
The data structures are described in figure 2. List headers are
sent to the SPE as mails from the PPE. A SPE can then issue two DMA
commands, the first one transmitting the transfer-list elements and
the second is using the transfer-list to gather the ray
packets.
18
B. Implementation on the CBE
In this section we describe the implementation of the Streaming
Model for a single SPE. The extension of this algorithm for
parallel execution on multiple SPEs will be the topic of section
V.
1) Sampling: Assuming that two neighboring voxel slices are located
in the LS of a SPE the set of rays that possess sample points
within the slab formed by the slices (figure 1) must be retrieved
from main memory, processed and written back. Because the LS size
is limited this ray set must be split into several subsets. A
triple buffer approach is necessary to overlap the data transfer
with computation. While one set of rays is being processed, the
next set is being fetched to the LS and results from the previous
set are written back to main memory. As the set of rays associated
with one slab of voxel slices can be precomputed, it is possible to
generate transfer lists for a given slab. Assisted by a
transfer-list the DMA controller can automatically gather a ray set
from main memory freeing the SPU for other tasks. The same
transfer- list can be used to scatter the ray set back to main
memory after the computation has finished. It should be noted here
that the transfer lists also minimize bandwidth requirements as no
redundant ray data has to be transmitted. Admittedly a transfer
list needs to be fetched from main memory prior to execution.
However this overhead is insignificant compared to ray data size
(figure 2). In order to exploit the SPE’s SIMD capabilities rays
can be processed in packets of four. Each ray requires three
direction components, a traversal depth value and four color
components (RGBA) for blending (figure 2). Using single precision
floating point values the size of a ray packet amounts to 128 bytes
which matches exactly one cache line. Thus ray packets that are
discontinuously distributed in main memory do not decrease
bandwidth if they share the same set.
Until now we have assumed that two full voxel slices can reside in
the LS at the same time. For an appropriate volume resolution
however the size of the LS is far too small. For this reason voxel
slices need to be partitioned into sub-slices along one axis. We
have chosen the y-axis for the remaining of the paper. The
partitioning of the volume data set into multiple sub-slice levels
is depicted in figure 1. Instead of tracing one full voxel slab at
once the process is serialized into sub-slabs with their associated
sub-sets of rays. The execution order of the sub-slabs is critical
for ensuring correct results as ray sub- sets are not disjoint in
most cases. An example is given in figure 4. The ray sub-sets for
the sub-slabs A,B,C and D are shown. The sub-set of A is empty, so
we do not consider it any further. B, C and D all contain one
independent ray that is not shared with any other sub-set (rays 6,
4, 2 respectively). B and C share ray 5 while D and C share ray 3.
This sharing implies that C must be processed prior to B and D in
order to maintain correctness because the blending of the samples
is not commutative1. In contrast rays 2, 4 and 6 can be processed
in
1For the Volume Rendering Integral blending is not commutative. For
the maximum/average intensity projection it is.
arbitrary order. As the arrows a) and b) indicate dependencies
between two sub-sets only exist in one direction of the y- axis.
The y-coordinate of the view point (red dot) separates the
sub-slabs with potentially positive dependencies (C,D) and
potentially negative dependencies (C,B,A). Note that rays with a
large y-direction component can share more than two sub- sets. Care
must be taken to prevent read-after-write hazards for rays
belonging to multiple sub-sets that can arise during DMA
transfers.
Multiple ray buffers offer the possibility to circumvent the strict
ordering rules for dependent ray sets and to efficiently eliminate
read-after-write hazards for the cost of higher mem- ory
consumption. Figure 3 illustrates a ray buffer that consists of all
the ray packets for a given configuration. If copies of this ray
buffer are available for all sub-slice levels intermediate blending
results can be computed for each level independently from the
others. The final compositing of these intermediate results is
described in section IV-B4. Multiple ray buffers are even more
attractive for the parallelized version of our algorithm (see
section V).
Fig. 4. Dependencies between sub-slabs. The sub-slabs are denoted
with upper case letters from A-D. The red dot represents the view
origin. The sub- sets of active rays for the sub-slabs are shown in
the list, dependencies are marked with red boxes. Independent rays
are 1,2,4 and 6 that pierce only one or no sub-slabs. Rays 3 and 5
are positive and negative dependent along the y-axis respectively.
The arrows a) and b) indicate the directions of dependence.
So far we have neglected the issue of synchronizing the rendering
process with volume acquisition. This is a funda- mental
requirement in applications like the one depicted in the abstract.
We use the mailbox mechanism of the CBE to control the rendering
process. When a new voxel slice has arrived in main memory the PPE
will send a message to the SPE which contains the list header
(figure 2) that allows the SPE to fetch the correct voxel data and
ray data (figure 3) into its LS. Every time a SPE has finished
processing a ray sub-set it queries its mail in-box for new jobs.
If no mail is available it will stall until new work or a
termination signal arrives. For a better understanding of how the
previously described algorithm is implemented on the SPU side see
figure 5.
2) Multiple Views: An obvious approach to rendering mul- tiple
views of a configuration simultaneously is to utilize one SPE for
each view as it is described in section V-B. A different
19
P t r t r a n s f e r L i s t D a t a B u f f e r [ 3 ] ; P t r r a
y P a c k e t D a t a B u f f e r [ 3 ] ; P t r s u b s l a b D a t
a B u f f e r [ 2 ] ;
Var curMai l , nex tMai l , cu r Idx , n e x t I d x , s l a b I d
x ;
whi le ( ! E x i t S i g n a l ( c u r M a i l ) ){
n e x t M a i l = GetNextMai l ( ) ; S t a r t T r a n s f e r L i
s t D a t a G a t h e r ( n e x t I d x ) ; S t a r t R a y P a c k
e t D a t a G a t h e r ( n e x t I d x ) ;
}
Wai tFo r RayPa cke t Da t a Ga th e r ( c u r I d x ) ;
f o r ( i =0 ; i<NumberOfRayPackets ( c u r M a i l ) ; i ++)
SampleRayPacket ( cu r Idx , i ) ;
}
Fig. 5. A closer look at the SPU kernel. An actual implementation
of this pseudo code can be found in appendix A. curIdx and nextIdx
are indices for the transfer list and ray packet triple buffers.
slabIdx is the index for the volume data double buffer. curMail and
nextMail contain the list headers received from the PPU. All DMA
transfers (except the subslab data gather) are issued with a fence
and a tag id equal to the index of the destination buffer.
technique takes advantage of the memory reference hierarchy (figure
3). List headers for multiple views can easily be mixed without the
notice of the SPE kernel (figure 6). All information required on
the SPE side is a list of all view origins of a configuration that
can be indexed with the view number contained in a given list
header (figure 2). For the parallelized version of our algorithm
this approach allows for overlapping certain stalls (section
V-A).
3) Preprocessing: Preprocessing for a given configuration is
straightforward. For each sub-slab all the ray packets of the
different views are tested for sample points within the sub- slab
to find the valid ray set. The ray packets in the ray set are
grouped by continuous main memory addresses (figure 3) and each
group is referenced by one list element or more if the group is
larger than 16Kb. List elements of the same sub-slab and the same
view are combined to form a transfer list that is referenced with a
transfer list header (figure 2).
4) Image compositing: Ultimately an image in RGB format is required
to be displayed on a monitor. Mapping the ray packets’ blending
values (figure 2) to pixel colors is straight- forward. The red,
green and blue color components need to be scaled, cast to integers
and stored into the framebuffer. This task can be computed by the
PPE or distributed among the SPEs. If multiple ray buffers are used
the different blending values for the same ray packet need to be
composited first in the correct order. The ray buffers of positive
and negative dependent sub-slice levels demand ordering along their
respec- tive direction of dependence, starting with the sub-slice
level
Fig. 6. The data flow of our algorithm. The scanning device writes
a new voxel slice into main memory (1) and notifies the PPE (2).
The PPE retrieves the next list header (3) and sends it to the
SPE’s MFC (4). The SPU receives the list header (5) and initiates
three DMA transfers (6). The first transmits the transfer-list to
the LS (8). The second uses the transfer-list (9) to gather the ray
set data into the LS (10). The third moves the voxel data into the
LS (7). When all transfers have completed the SPU loads the
required data for the sampling process. After computation it writes
the results back to the LS (11). As soon as all ray packets have
been processed another DMA transfer is initiated (6) that scatters
the ray set data back to main memory (12) assisted by the
transfer-list (9). Finally an RGB image is extracted from the ray
packet data and sent to the display (13).
that contains the y-origin of the given view (figure 4). At this
point all components required for our rendering
system have been described to enable an implementation on a single
SPE. A summary of the data flow is given in figure 6. Note that for
simplicity an unlimited LS size is assumed so that all required
data fits into it at the same time. In practise some of the
depicted steps need to be subdived into smaller data packages. In
the next section we will examine possibilities for distributing our
algorithm among multiple SPEs.
V. PARALLELIZATION
There are two basic approaches to parallelize our algorithm
introduced in section 2. The fine grained solution operates at
sub-slab granularity where each SPE is assigned one sub-slice
level. The coarse grained model ties one or more independent views
to different SPEs.
A. Fine-grained Parallelization
The subdivision of voxel slices into sub-slices which ini- tially
has been introduced to account for the limited LS size now offers a
convenient approach for parallelization. T