Upload
keahi
View
60
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Power-Efficient Medical Image Processing using PUMA. Ganesh Dasika , Kevin Fan 1 , Scott Mahlke. University of Michigan Advanced Computer Architecture Laboratory. 1 Parakinetics, Inc. The Advent of the GPGPU. Increasingly popular substrate for HPC Astrophysics Weather Prediction EDA - PowerPoint PPT Presentation
Citation preview
University of MichiganElectrical Engineering and Computer Science
Power-Efficient Medical Image Processing using PUMA
Ganesh Dasika, Kevin Fan1, Scott Mahlke
1Parakinetics, Inc.
University of MichiganAdvanced Computer Architecture Laboratory
University of MichiganElectrical Engineering and Computer Science2
The Advent of the GPGPU• Increasingly popular
substrate for HPC– Astrophysics– Weather Prediction– EDA– Financial instrument pricing– Medical Imaging
University of MichiganElectrical Engineering and Computer Science3
Advantages of GPGPUs• High degree of parallelism
– Data-level– Thread-level
• High bandwidth• Commodity products• Increasingly programmable
University of MichiganElectrical Engineering and Computer Science4
Disadvantages of GPGPUs• Gap between computation and bandwidth
– 933 GFLOPS : 142 GB/s bandwidth(0.15B of data per FLOP, ~26:1 Compute:Mem Ratio)
• Very high power consumption– Graphics-specific hardware– Multiple thread contexts– Large register files and memories– Fully general datapath
Inefficiencies in allgeneral-purpose architectures
University of MichiganElectrical Engineering and Computer Science5
Programmability vs Efficiency?
FPGAs
General PurposeProcessors
DSPsDomain-specific
Accelerators,GPGPUs
Efficiency
Flex
ibilit
y
5
Loop Accelerators,ASICs
???
Highly efficient,some programmability
University of MichiganElectrical Engineering and Computer Science6
Medical Image Reconstruction• Compute intensive loops
– 32-bit floating point code– High data/bandwidth requirements
• Increased demand for portability, low power• Much current research focuses on using GPGPUs
for this domain
University of MichiganElectrical Engineering and Computer Science7
CT Image reconstruction• X-Ray emitters and
receptors on opposite sides of patients
• Received x-ray intensity corresponds to tissue density
• Multiple scans (“slices”) taken around patient put together to reconstruct 1 2D-image
University of MichiganElectrical Engineering and Computer Science8
Projection & Sinogram
Sinogram:All projections
Projection:All ray-sums in a direction
P(t)
f(x,y)
t
y
x
X-raysSinogram
t
p
University of MichiganElectrical Engineering and Computer Science9
Example: BackprojectionSinogram Backprojected Image
University of MichiganElectrical Engineering and Computer Science10
Example:Filtered Backprojection
Filtered Sinogram Reconstructed Image
University of MichiganElectrical Engineering and Computer Science11
Reconstruction: Solve for m’s
m11 m12 m13 m14
m21 m22 m23 m24
m31 m32 m33 m34
m41 m42 m43 m44
16 22 11 10
X-RayEmitter
DetectorValues
Densities
“Human Body“
22
12
10
15
University of MichiganElectrical Engineering and Computer Science12
Real Reconstruction Problem• Intensity measured • Rays transmitted
through multiple “pixels”
• Find individual “pixel” values from transmission data
? ? ? ?? ?
? ? ? ?? ?
? ? ? ?? ?
? ? ? ?? ?
? ? ? ?? ?
? ? ? ?? ? 534
417
364
555
501
355
255712
199
512 values
512values
100’s of diagonals @
100’s of angles
University of MichiganElectrical Engineering and Computer Science13
Medical Imaging Applications
• Image reconstruction for MRI/CT/PET scans• Large amounts of Vector/Thread-level parallelism• FP-intensive kernels
– Often requiring math library functions• Data-intensive (~5:1 compute:mem ratio)
Benchmark Inner-loop%Scalar/Vector Outer-loop TLP Compute:Mem
ratio
Segmentation Fully vectorizable Do-all 4:1
Laplacian Filtering Fully vectorizable Do-all 3:1Gaussian
ConvolutionFully vectorizable with predicates Do-all 6:1
MRI FH Vector Fully vectorizable Do-all 6:1
MRI Q Vector Fully vectorizable Do-all 5.5:1
University of MichiganElectrical Engineering and Computer Science14
• Currently, most scans requiremoving patient to imaging room– Consumes time– Stress on patient
• Studies show benefits of portable, bed-side scanners:– 86% increase in patients suitable for post-stroke thrombolytic
therapy [Weinreb et al, RSNA]– 80-100% drop in scan-related complications
[Gunnarsson et al, J. of Neurosurgery]• New X-Ray emitters push for mAs of current use
Current Concerns: Portability/Power
University of MichiganElectrical Engineering and Computer Science15
Current Concerns: Performance• High-accuracy CT algorithms
take too long– Iterative forward/backward
projection– ~Hours on modern CT scanners
instead of minutes• Interventional radiology
– Scans currently takes minutes, but should take seconds
• CT-Flouroscopy– Several scans done in succession
University of MichiganElectrical Engineering and Computer Science16
Flexibility• Software algorithms change over time• NRE• Time-to-market
16
University of MichiganElectrical Engineering and Computer Science17
PUMA• Tiled architecture• Bandwidth-matched for
improved efficiency• Each tile is a
“Programmable Loop Accelerator” Extern. Interface
CPU Mem Disk …
University of MichiganElectrical Engineering and Computer Science18
Programmable Loop Accelerator• Generalize accelerator without losing efficiency
FPGAs
Efficiency, Performance
Flex
ibilit
y
Loop Accelerators,ASICs
ProgrammableLoop Accelerators
18
General PurposeProcessors
DSPsDomain-specific
Accelerators,GPGPUs ???
University of MichiganElectrical Engineering and Computer Science19
Designing Loop Accelerators
C Code Loop
19
Hardware
Point-to-point Connections
BR
CRF
+
… …
&
… …
MEM
… …
LocalMem
+
……
*
……
MEM
……
<<
……
LocalMem
University of MichiganElectrical Engineering and Computer Science20
Loop Accelerator Architecture
Point-to-point Connections
+
… …
&
… …
MEM
… …
LocalMem
FSM
Controlsignals
CRF
BR
Hardware realization of modulo scheduled loopParameterized hardware:• FUs• Shift Register Files
20
• Static Control• Point-to-point Interconnect
University of MichiganElectrical Engineering and Computer Science21
Programmable Loop-Accelerator Architecture
Point-to-point Connections
+/-
… …
&/|
… …
MEM
… …
LocalMem
ControlMemory
Controlsignals
CRF
BR
RR RRRRRR
Literals
Ring
Functionality Storage Connectivity Control
LA PLACustom FU set Generalized FUs + MOVs
Point-to-point Ring + Port-swapping
Limited size, no addr. Rotating Reg. Files
Hardwired Control Lit. Reg. File + Control Mem
21
+ &
SRF SRFSRFSRF
FSM
University of MichiganElectrical Engineering and Computer Science22
MRI.FH PLA• ~0.6 mm2 per tile• 38 FUs• 128 32-bit registers• Inter-FU BW 1 TB/sec
FU Type #
FP-ADDSUB 6
FP-MPY 9
I-ADDSUB 8
MEM 9
I-MPY 1
Other 5
University of MichiganElectrical Engineering and Computer Science23
Performance on MRI.FH PLA
MRI.FH MRI.Q CT.segment CT.laplace CT.gauss0.0
0.2
0.4
0.6
0.8
1.0
Non-Generalized Generalized
Norm
alize
d Pe
rform
ance
II preserved
II doubled
Unschedulable
University of MichiganElectrical Engineering and Computer Science24
Efficiency on MRI.FH PLA
MRI.FH MRI.Q CT.segment CT.laplace CT.gauss mean0.0
0.2
0.4
0.6
0.8
1.0
Non-Generalized Generalized
Norm
alize
d
Perf/
Powe
r Ef-
ficie
ncy
University of MichiganElectrical Engineering and Computer Science25
PUMA System Design• 5 systems designed
around 5 benchmarks• Each composed of
identical tiles• Assume same B/W as
GTX280 (142 GB/s)• # Tiles based on B/W
requirements of benchmark
Extern. Interface
CPU Mem Disk …
University of MichiganElectrical Engineering and Computer Science26
System Performance
MRI.FH MRI.Q CT.segment CT.laplace CT.gauss0
20406080
100120140160
Theoretical Realized
GOPs
/sec
4W 3W 2.8W 2.3W 2.7W
University of MichiganElectrical Engineering and Computer Science27
Performance vs. GPGPU
PUMA GTS 250 GTX 260 GTX 280 GTX 285 GTX 2950.00.20.40.60.81.01.21.41.61.82.0
Theoretical Realized
TOPs
/sec
63% performance of GTX 295
2X performance of GTS 250
University of MichiganElectrical Engineering and Computer Science28
Efficiency vs. GPGPU
MRI.FH MRI.Q CT.segment CT.laplace CT.gauss0
10
20
30
40
50
60
GTS 250 GTX 260 GTX 280 GTX 285 GTX 295
PUM
A Pe
rf/P
ower
ef
-fic
ienc
y ov
er G
PU 22X
54X
University of MichiganElectrical Engineering and Computer Science29
Conclusions• Power-efficient accelerator for medical imaging• ASIC-like efficiency with programmability• 63-201% of GPU performance• 22-54X GPU Performance/Power efficiency
University of MichiganElectrical Engineering and Computer Science30
Thank you!!
Questions?