Where Tegra meets Titan - NVIDIAon-demand.gputechconf.com/gtc/2016/presentation/s...Distributed Robotic Vision! But often this isnʼt the best solution !e.g. latency introduced by

Where Tegra meets Titan!

Prof Tom Drummond!

Computer vision is easy!!But first a diversion to 10th Century Persia …!

! ! ! ! ! ! !… and the first recorded game of chess!

The rice and the chessboard!






First half of the chessboard: 100 tons of rice


First half of the chessboard: 100 tons of rice

Second half of the chessboard: 400 billion tons of rice = 1000 years of production

And the moral of the story is …

The transistor and the chessboard!

The transistor and the chessboard!1974: Intel 8080 (6,000 transistors) 1978: Intel 8086 (29,000 transistors) 1982: Intel 80286 (134,000 transistors) 1993 Intel Pen:um (3,000,000 transistors) 2004 P4 Intel Presco> (125,000,000 transistors)

The transistor and the chessboard!

?

How many on the last square…?

1974: Intel 8080 (6,000 transistors) 1978: Intel 8086 (29,000 transistors) 1982: Intel 80286 (134,000 transistors) 1993 Intel Pen:um (3,000,000 transistors) 2004 P4 Intel Presco> (125,000,000 transistors) This notebook > 2 trillion transistors

2004: Nvidia NV40 (222,000,000 transistors) 2006: Nvidia G80 (484,000,000 transistors) 2008: Nvidia GT200 (1,400,000,000 transistors) 2010: Nvidia GF104 (1,900,000,000 transistors) 2012: Nvidia GK104 (3,540,000,000 transistors) 2015: Nvidia GM200 (8,000,000,000 transistors)

Can run Mooreʼs law backwards!Q: According to Moore’s law, when was there just one transistor? A: 1948

Can run Mooreʼs law backwards!Q: According to Moore’s law, when was there just one transistor? A: 1948

In Nov 1947, Bardeen, Bra>ain and Shockley a>ached two gold contacts to a crystal of germanium…

Power!

Mooreʼs law gives us increasing compute power!

BUT!

With great power comes great …!

Mooreʼs Law is not always our friend!!

Even with GPUs, compute on mobile devices is limited Can’t put a K40 on a Quadrotor!

Mooreʼs Law is not always our friend!!

Even with GPUs, compute on mobile devices is limited But a TX1 fits just fine! (Stereolabs TX1 enabled drone)

ACRV!

The Australian Research Council Centre of Excellence for Robo:c Vision •  $25.5M over 7 years •  13 Chief Inves:gators in 4 Universi:es •  16 Research Fellows •  ~50 PhD students •  Research into:

–  Seman:cs (deep learning) –  Robust vision (all weathers) –  Vision and Ac:on (closing the loop) –  Algorithms and Architecture (constrained resources)

Distributed Robotic Vision!

Simplest method is to just partition the problem somewhere, giving some tasks to the mobile and some to the server!

mobile server


But often this isnʼt the best solution !e.g. latency introduced by the network may be a problem!

Many interesting solutions not like this, e.g:!

Obtain sensor data

Extract summary

informa:on

Compute accurate solu:on

Compute approximate solu:on

Compare

Calculate output

Update local model

Bring correc:on up to date

Calculate and send correc:on

Compute approximate solu:on


Want to create solutions to enable robotics in a distributed sensing and compute environment!

TX1

TX1 TX1

K40

K40

K40

K40

K40

K40

K40

K40

CPU

CPU

Distributed Localisation Service!

Extract landmarks CCTV1 Build Image

Pyramid Build

Descriptors Index Match

Extract landmarks CCTV2 Build Image

Pyramid Build

Descriptors Index Match

Extract landmarks Robot Build Image

Pyramid Build

Descriptors

Compute 1 Compute Robot pose

Distributed Localisation Service!==3031== NVPROF is profiling process 3031, command: ./ComputeOrb 1!Frame# 1!Elapsed time : 5.955523 ms!Frame Elapsed time : 7.765627 ms!

numCorners: 28304, nmsnumCorners: 5073!==3031== Profiling application: ./ComputeOrb 1!==3031== Profiling result:!

Time(%) Time Calls Avg Min Max Name! 57.18% 3.2379ms 1 3.2379ms 3.2379ms 3.2379ms OrbDescriptors(…)! 30.57% 1.7312ms 1 1.7312ms 1.7312ms 1.7312ms (…)! 4.29% 242.92us 1 242.92us 242.92us 242.92us fastcorner(…)!

4.00% 226.31us 1 226.31us 226.31us 226.31us harris(…)! 1.46% 82.553us 1 82.553us 82.553us 82.553us NMS(…)! 0.73% 41.458us 1 41.458us 41.458us 41.458us cleansweep(…)!

!

Speedup over CPU* implementation is 4-5X!

!

* Intel Core2 Quad Q8400 @2.66Ghz!

Sub-pixel localisation!

Timing Results: ! ! !(µs/keypoint)Inverse Additive ! ! !672 Inverse Compositional !367 Ours ! ! ! ! !7!

Extract image patch Camera 1 Find

landmarks

Compute matrix Compute 1

Camera 2 Extract image patch

Find landmarks

Compute sub-‐pixel correspondence on many subsequent frames

Compute sub-‐pixel correspondence on many subsequent frames

Approximate Nearest Neighbor!Big data in high dimensional spaces Given a query point, find the nearest reference point Solu:on: FANNG (Fast Approximate Nearest Neighbor Graphs) @CVPR 2016 Can serve 1.2M queries/second at 90% recall in a database of 1M reference points in 128D space on Titan X

Approximate Nearest Neighbor!CUDA implementa:on requires a short priority queue BUT int array[30]; // very slow global memory!

Solu:on is to treat a warp as a single unit with array spread over the warp in a single register: int array; // there are 32 of these in a warp !...!// find the first entry in array that is > thresh!int pq = __ffs(__ballot(array > thresh));!...!!

Approximate Nearest Neighbor!Want to keep the array sorted when we insert a new value, discarding the largest value

1 2 4 5 9 11 13 15 array:

0 1 2 3 4 5 6 7 thread:

new_value: 8

8 8 8 8 9 11 13 15 ship value:

8 8 8 8 9 11 13 shuffle:

(each thread sees this value)

=max(new_value,array)

Write new value if less than array

1 2 4 5 8 9 1 13 array:

8 8 8 8 8 8 8

Documents

Where Tegra meets Titan - NVIDIAon-demand.gputechconf.com/gtc/2016/presentation/s...Distributed Robotic Vision! But often this isnʼt the best solution !e.g. latency introduced by