Overviewcis565/LECTURE2010/Video... · 2010-11-19 · Dirac Wavelet Video Codec (DWVC) Video compression format Open source, royalty‐free alternative to H.264; roughly equivalent

Overview   Videos are everywhere

  But can take up large amounts of resources   Disk space   Memory   Network bandwidth

  Exploit redundancy to reduce file size   Spatial   Temporal

General lossless compression   Huffman compression – shorter bit sequences for common

data

  Lempel‐Ziv – short bit sequence for previously seen strings

Transform coding   Perform some transformation on data

  Does not reduce data size, usually theoretically lossless   Concentrate information in a small(er) number of data points

  Quantize data (lossy)   Most data points are smaller numbers

  Losslessly compress data stream   The typical range of data is smaller   Fewer bits required to store common case

Discrete Cosine Transform (DCT)   Traditional lossy compression

  Converts a function of time to a function of frequency   Weighted sum of cosine functions   Information from the original signal can be completely

reconstructed from generated weights

  FFT: O(NlogN) vs. O(N^2)

2D DCT   Treat each row of the signal as a 1D signal, perform 1D

transform

  Treat each column of the transformed signal as a 1D signal, perform another 1D transform

  Separable transformation   2nk vs. nk^2

  3D extension?

Transform coding   DCT itself does not perform any compression

  Images concentrate most of their information in low‐frequency components

  High‐frequency components can be stored with less precision – human visual system   Often high‐frequency components round to zero and loss of

information not noticeable

Global transform   DCT acts on an entire signal

  So perform on image blocks

  One value per frequency for an entire block

  Block Artefacts

  Image discontinuities   Sharp edges dividing otherwise relatively low‐frequency

areas   High frequency components localized to small number of

pixels   DCT is less effective at representing these compactly

Discrete Wavelet Transform (DWT)

  Decomposition into two signals, with half resolution of input

  Approximation signal   low‐res version of original

  Contains only low frequencies

  Detail signal   Information lost be reducing the resolution   Contains only high frequencies

Discrete Wavelet Transform (DWT)

  Approximation signal recursively transformed   Image entirely converted to detail signals of various

resolutions

  Final result is effectively a sum of scaled and translated versions of a wavelet (small portion of a wave)   Wavelets have location, waves have phase

  Avoids undershoot and ringing

  2D DWT often separable (though depends on wavelet)   Square decomposition

The Haar Wavelet

More complicated wavelets

Locality   Detail signal is not transformed

  Despite being high frequency, discontinuities will remain localized

  Can be less effective for periodic signals, better for images

Motion compensation   Calculate motion direction of parts of an image

  Temporal coherence: Similarity between neighboring video frames

  Global – describe motion of camera

  Local – describe motion of small objects (within a block of an image)

  Motion compensation => a next‐frame prediction

  Residue (difference from prediction) is stored

  Accelerating Wavelet‐Based Video Coding on Graphics Hardware using CUDA   Wladimir J. van der Laan, Jos B.T.M. Roerdink, Andrei C. Jalba

Dirac Wavelet Video Codec (DWVC)

  Video compression format

  Open source, royalty‐free alternative to H.264; roughly equivalent quality

  BBC Research

  Dirac‐research – reference implementation

  Schrödinger – high performance   Heavily optimized   Good basis for performance comparison

DWVC Decoding   Stream data

  Intra‐frames – self contained images

  Inter‐frames – difference with respect to one or two reference frames

  Arithmetic decoding – lossless; extracts parameters, vectors, coefficients from bitstream   Reversed entropy coder, which represents common values with

shorter bit sequences

  Little inherent parallelism – handled by CPU

  Motion compensation   Residue (difference from prediction) stored as wavelet coefficients

CUDA Implementation   Use CUDA to avoid mapping decoding process to

rendering pipeline

  Lifting scheme – less arithmetic, in‐place

  Frame arithmetic – 16 vs 32 bit?

  Sub pixel precision   Bicubic interpolation of reference frame

Separable transformation for wavelet lifting

  Decompose 2D op into 2 1D ops

Horizontal Pass   Coalesced read part of a row

  Duplicate border elements – boundary conditions

  Shared memory: in‐place lifting

  Syncthreads after each step in transform

  Coalesced write back to global

  Reorganized coefficients – based on JPEG 2000 cache‐efficient wavelet lifting

Vertical pass   Substituting rows for columns ‐> poor coalescing

  Each block processes multiple columns: a slab

  Each row in a slab can be read with coalescing

  Shared memory: transform on columns

  Sliding window – not all columns can fit in shared

Motion compensation Block placement

  Traditional   Divide image into equally‐sized, disjoint blocks   Strong discontinuities between neighboring blocks

  Poor prediction on block edges

  Overlapped Block Motion Compensation   Overlaps neighboring blocks   Blending together in shared area

Reference frame options   Previous frame

  Previous and next (blended together with some weights) – for fades

  A different frame several frames back – if better match

Overlapped blocks   Each pixel part of up to four motion compensation blocks

per frame

  Naïve implementation   Equally sized CUDA blocks   Complicated flow control – neighboring pixels access

different motion comp. blocks

Solution: Divide image into regions

  Based on number of and orientation of overlapping blocks   Center – 1 block   Edges – 2 blocks (H or V overlap), linear blend   Corners – 4 blocks, bilinear blend

  All pixels in a region have same code   Each region is processed by one CUDA block   No block divergent branching

  Texture faster than constant memory   Each thread potentially accesses a different location

Results   Dual Core AMD Opteron 280 vs Nvidia GeForce GTX280,

CUDA 2.2

  Single threaded   GPU times do not require readback (video is displayed

through OpenGL textures)   5.4x overall speedup for entire decode process   13x speedup for GPU operations (arithmetic decoding

excluded)   1920x1080 (1080p) displayed at 56.4fps

  25 fps needed for movie playback

  10.5 fps for CPU reference

  Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting   Christian Tenllado, Javier Setoain, Manuel Prieto, Luis Piñuel,

Francisco Tirado

Focus on DWT   Has other image processing/computer graphics

applications – multiresolution analysis

  Primary methods:   Filter bank   Lifting scheme

Filter bank   Given signal A:

  Run low pass filter (convolution) on A to get low frequency approximation (~blur)

  Run corresponding high pass filter on A to get high frequency details

  Halve frequency of both (since we now have twice as much information as necessary)

  Recurse on approximation

  Direct translation of definition of wavelet transform

Lifting scheme   Combine highpass and lowpass filters

  Any FBS wavelet can be factorized into several LS steps with Polyphase Matrix representation

  Split signal into odd/even values (lazy wavelet transform)

  Predict

  Update

LS Advantages   Simple to invert: run in opposite direction (no reverse

convolution)

  Method for producing wavelet transforms   Control over the actual operations that are executed   Can use integer operations ‐> lossless compression

  Easy to generalize   + must be invertible but doesn’t have to be +

  Tends to be more efficient w.r.t. amount of hardware or power consumption for embedded systems

FBS vs LS Speed   CPU: LS up to twice the speed of FBS

  Performs about half as many computations   Though actual gains are often smaller than theoretical

  In‐place transform

  LS is default way to implement wavelet transform – seen as most efficient

  GPU: FBS is actually faster   Fewer synchronization barriers

Implementation   OpenGL + Cg

  Layout: 2x2 locks stored in RGBA texel – allows H and V algorithms to be designed symmetrically

  Filter bank – synch barrier between H and V filters

  Lifting scheme   Several loops to perform simple vector operations on each

data stream

  Every LS step performed by a different kernel   Many synch barriers

Results   Execution times scale linearly with problem size

  Ratio of LS time to FBS time ‐> constant as size grows

  Speedups from Nvidia FX 5950 Ultra (2003) to 7800 GTX (2005)   4x for FBS   2.2x for LS

Results   Key performance factor is # rendering passes and synch

barriers   FBS doesn’t require pipeline flush, allows better

parallelization   LS: removing synch barriers (incorrect output, but good

performance estimate)   1.4x speedup

  GPU: 1.2 – 3.4x speedup over CPU implementation w/o data transfer

  Transform a 4M pixel image in 9.12 and 17.9 ms using FBS and LS using Daubechies‐4   Slower times for more complicated wavelets

Future improvements   LS/FBS time ratio grows as # shader processors increase –

future GPUs will progressively favor FBS

  Waiting for better CPU/GPU integration

  Suggest fusing consecutive kernels – increased complexity, but faster

Summary   GPU allows several times speedup over CPU for

decompression with modern codecs   May not seem dramatic, but helps cross barrier over movie

fps rate   Allows more types of compression algorithms to become

feasible

  Methods for implementation best for CPU may not be best for GPU

Documents

Overviewcis565/LECTURE2010/Video... · 2010-11-19 · Dirac Wavelet Video Codec (DWVC) Video compression format Open source, royalty‐free alternative to H.264; roughly equivalent