Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Computation of Mutual Information Metric for Image Registration on Multiple GPUs
Andrew V. Adinetz1, Markus Axer2, Marcel Huysegoms2, Stefan Köhnen2, Jiri Kraus3, Dirk Pleiter1
26.03.2014
1 JSC, Forschungszentrum Jülich 2 INM-1, Forschungszentrum Jülich 3 NVIDIA GmbH
Presented at HeteroPar’13 workshop of EuroPar‘13
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• Brain Image Registration • Multi-GPU Implementation
• system memory • listupdate
• Performance Evaluation • Conclusion
Outline
March 26, 2014 2 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Preparation of the brain
March 26, 2014 3 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
BigBrain – first high-resolution brain model at microscopical scale
! 7404 histological sec/ons stained for cell bodies ! scanned with a flad bed scanner ! original resolu/on 10 × 10 × 20 μm3 (11.000 × 13.000 pixels) ! downscaling to 20 μm isotropic ! removal of ar/facts ! 1 Terabyte
in cooperation with Alan Evans, McGill, Montreal
Amunts et al. (2013) Science
Pushing the limits for a cellular brain model
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• Registration = process of image alignment
Image Registration
ITK Workflow
March 26, 2014 8 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• i, j – pixel values (0 .. 255)
• successful for multi-modal registration
Mutual Information Metric
€
MI(I f ,Im ) = p(i, j)log2i, j∑ p(i, j)
pf (i)pm ( j)
pf (i) = p(i, j)j∑
pm ( j) = p(i, j)i∑
March 26, 2014 9 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• main computational kernel • transform can be complex (1000+ parameters) • GPU implementation: 1 pixel/thread, atomics
Two Image Cross-Histogram
for(int y = 0; y < fixed_sz_y; y++) for(int x = 0; x < fixed_sz_x; x++) { int i = bin(fixed[x, y]); float x1 = transform_x(x, y); float y1 = transform_y(x, y); int j = bin(interpolate(moving, x1, y1)); histogram[i, j]++; // atomic on GPU }
March 26, 2014 10 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Large Data Size
size: 3.000 × 3.000 px
pixel size: 60 × 60 µm
file size: 30 MB
Large-area Polarimeter
size: 100.000 × 100.000 px
pixel size: 1.6 x 1.6 µm
file size: 40 GB
Polarizing Microscope
March 26, 2014 11 GPU Technology Conference 2014
Need mul(ple GPUs!
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• Domain decomposition • distribute fixed and moving images • histogram contributions summed up
• Moving image: how to handle? • irregular access pattern
• Approaches • System memory replication (sysmem) • Listupdate (listupdate)
Multi-GPU Mutual Information
March 26, 2014 12 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• Replicate entire moving image in pinned host RAM • accessible to GPU
+ easy to implement
– system memory accesses are slower – cannot use texture interpolation
• Optimizations • moving image halo in GPU RAM
System Memory Replication
March 26, 2014 13 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• On remote access • „send message“
• „On receiving message“ • compute contributions
• Active messaging variant • buffering • relies on undocumented features
• Listupdate • chunking • buffer size bounded • communication-computation
overlap
Listupdate typedef struct { float[2] movingCoords; short destRank; char fixedBin; } message_t;
March 26, 2014 14 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Writeout: Atomics vs Grouping
March 26, 2014 15 GPU Technology Conference 2014
Atomics
Grouping
write to per-‐pixel buffer
group (compress)
determine write posi(on using atomics
warp-‐aggregated increment
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Chunk Processing and Overlap
Process chunk Group Exchange Handle
messages
Process chunk Group Exchange
Process chunk Group 1
2
Fixed Image Fixed Image
y
x (0,0)
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
+ computation-communication overlap – hard to implement – chunk processing (or won‘t fit into buffer)
• Optimizations • buffers: AoS vs. SoA • atomics vs. grouping • using multiple streams
Listupdate typedef struct { float[2] movingCoords; short destRank; char fixedBin; } message_t;
March 26, 2014 17 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Benchmark setup
Fixed Image Fixed Image
y
x (0,0)
Remote access
Mask
March 26, 2014 18 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• JUDGE • 256-node GPU cluster • Each M2070 node:
• 2x M2070 (Fermi) GPU, each 6 GB RAM • 12-core X5650 CPU @ 2.67 GHz, 96 GB RAM
• JuHydra • single-node Kepler machine
• 2x K20X (Kepler) GPU, each 6 GB RAM • 16-core E5-2650 CPU @ 2 GHz, 64 GB RAM
Test Hardware
March 26, 2014 19 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Baseline: Full Replication (M2070)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5.4
10.8
16.2
21.6
27
32.4
37.8
43.2
48.6
54
59.4
64.8
70.2
75.6
81
86.4
91.8
97.2
102.6
108
113.4
118.8
124.2
129.6
135
140.4
145.8
151.2
156.6
162
167.4
172.8
178.2
Run/
me in secon
ds
Rota/on angle
1 -‐ GPU
2 -‐ GPUs
4 -‐ GPUs
ideal scalability March 26, 2014 20 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Sysmem on Fermi
0
0.2
0.4
0.6
0.8
1
1.2
0 5.4
10.8
16.2
21.6
27
32.4
37.8
43.2
48.6
54
59.4
64.8
70.2
75.6
81
86.4
91.8
97.2
102.6
108
113.4
118.8
124.2
129.6
135
140.4
145.8
151.2
156.6
162
167.4
172.8
178.2
Run/
me in secon
ds
Rota/on angle
1-‐GPU
2-‐GPUs Baseline
2 GPUs
March 26, 2014 21 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Sysmem on Fermi: Explanation
No sysmem Access Good Coalescing
Few sysmem Access Bad Coalescing
Many sysmem Access Bad Coalescing
Most sysmem Access Good Coalescing
March 26, 2014 22 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Sysmem on Fermi: PCI-E Queries
0
20000000
40000000
60000000
80000000
100000000
120000000
0
0.2
0.4
0.6
0.8
1
1.2
0 5.4
10.8
16.2
21.6
27
32.4
37.8
43.2
48.6
54
59.4
64.8
70.2
75.6
81
86.4
91.8
97.2
102.6
108
113.4
118.8
124.2
129.6
135
140.4
145.8
151.2
156.6
162
167.4
172.8
178.2
Sysm
em_q
ueries
Run/
me in secon
ds
Rota/on angle
2-‐GPUs Baseline 2 GPUs Total Sysmem_queries
March 26, 2014 23 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Sysmem: Halo Sizes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1.8 3.6 5.4 7.2 9 10.8 12.6 14.4 16.2 18 19.8 21.6 23.4 25.2 27 28.8 30.6 32.4 34.2 36
Time, s
Angle, degrees
2 K20X, baseline 2 K20X, sysmem 2 K20X, 5% halo 2 K20X, 10% halo
2 K20X, 15% halo 2 K20X, 20% halo 2 K20X, 25% halo
mostly quan(ta(ve, not qualita(ve difference March 26, 2014 24 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Listupdate: Multiple Streams
4 streams look the best
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
5.4
10.8
16.2
21.6
27
32.4
37.8
43.2
48.6
54
59.4
64.8
70.2
75.6
81
86.4
91.8
97.2
102.6
108
113.4
118.8
124.2
129.6
135
140.4
145.8
151.2
156.6
162
167.4
172.8
178.2
Time, s
Angle, degrees
2 K20X, 1 stream 2 K20X, 2 streams 2 K20X, 3 streams 2 K20X, 4 streams
March 26, 2014 25 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Listupdate: AoS vs SoA, Atomics vs Group
SoA + atomics looks best
0
0.2
0.4
0.6
0.8
1
1.2
0
5.4
10.8
16.2
21.6
27
32.4
37.8
43.2
48.6
54
59.4
64.8
70.2
75.6
81
86.4
91.8
97.2
102.6
108
113.4
118.8
124.2
129.6
135
140.4
145.8
151.2
156.6
162
167.4
172.8
178.2
Time, s
Angle, degrees
2 K20X, SoA 2 K20X, AoS 2 K20X, compress
March 26, 2014 26 GPU Technology Conference 2014
typedef struct { float[2] movingCoords; char fixedBin; } message_t;
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Sysmem vs. Listupdate: Fermi
0
0.5
1
1.5
2
2.5
0 5.4
10.8
16.2
21.6
27
32.4
37.8
43.2
48.6
54
59.4
64.8
70.2
75.6
81
86.4
91.8
97.2
102.6
108
113.4
118.8
124.2
129.6
135
140.4
145.8
151.2
156.6
162
167.4
172.8
178.2
Time, s
Angle, degrees
4 M2070, SoA 4 M2070, baseline 4 M2070, sysmem 4 M2070, 25% halo
on Fermi, sysmem is be_er March 26, 2014 27 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Sysmem vs. Listupdate: Kepler (Closeup)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1.8 3.6 5.4 7.2 9 10.8 12.6 14.4 16.2 18 19.8 21.6 23.4 25.2 27 28.8 30.6 32.4 34.2 36
Time, s
Angle, degrees
2 K20X, SoA 2 K20X, baseline 2 K20X, sysmem 2 K20X, 25% halo
on Kepler, listupdate is be_er March 26, 2014 28 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• Fermi • performance limited by atomics • system memory replication is better
• Kepler • 10x faster than Fermi • no longer dominated by atomics • listupdate (atomic, SoA, 4 streams) is better
• Future work • Compression • Trials on real images
Conclusions
March 26, 2014 29 GPU Technology Conference 2014
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
• INM-1 at FZJ: http://www.fz-juelich.de/inm/inm-1/EN/Home/home_node.html
• NVidia Application Lab at FZJ: http://www.fz-juelich.de/ias/jsc/nvlab • Andrew V. Adinetz: [email protected] • Jiri Kraus: [email protected] • Dirk Pleiter: [email protected]
Questions
?
March 26, 2014 30 GPU Technology Conference 2014