Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Overview Videos are everywhere
But can take up large amounts of resources Disk space Memory Network bandwidth
Exploit redundancy to reduce file size Spatial Temporal
General lossless compression Huffman compression – shorter bit sequences for common
data
Lempel‐Ziv – short bit sequence for previously seen strings
Transform coding Perform some transformation on data
Does not reduce data size, usually theoretically lossless Concentrate information in a small(er) number of data points
Quantize data (lossy) Most data points are smaller numbers
Losslessly compress data stream The typical range of data is smaller Fewer bits required to store common case
Discrete Cosine Transform (DCT) Traditional lossy compression
Converts a function of time to a function of frequency Weighted sum of cosine functions Information from the original signal can be completely
reconstructed from generated weights
FFT: O(NlogN) vs. O(N^2)
2D DCT Treat each row of the signal as a 1D signal, perform 1D
transform
Treat each column of the transformed signal as a 1D signal, perform another 1D transform
Separable transformation 2nk vs. nk^2
3D extension?
Transform coding DCT itself does not perform any compression
Images concentrate most of their information in low‐frequency components
High‐frequency components can be stored with less precision – human visual system Often high‐frequency components round to zero and loss of
information not noticeable
Global transform DCT acts on an entire signal
So perform on image blocks
One value per frequency for an entire block
Block Artefacts
Image discontinuities Sharp edges dividing otherwise relatively low‐frequency
areas High frequency components localized to small number of
pixels DCT is less effective at representing these compactly
Discrete Wavelet Transform (DWT)
Decomposition into two signals, with half resolution of input
Approximation signal low‐res version of original
Contains only low frequencies
Detail signal Information lost be reducing the resolution Contains only high frequencies
Discrete Wavelet Transform (DWT)
Approximation signal recursively transformed Image entirely converted to detail signals of various
resolutions
Final result is effectively a sum of scaled and translated versions of a wavelet (small portion of a wave) Wavelets have location, waves have phase
Avoids undershoot and ringing
2D DWT often separable (though depends on wavelet) Square decomposition
The Haar Wavelet
More complicated wavelets
Locality Detail signal is not transformed
Despite being high frequency, discontinuities will remain localized
Can be less effective for periodic signals, better for images
Motion compensation Calculate motion direction of parts of an image
Temporal coherence: Similarity between neighboring video frames
Global – describe motion of camera
Local – describe motion of small objects (within a block of an image)
Motion compensation => a next‐frame prediction
Residue (difference from prediction) is stored
Accelerating Wavelet‐Based Video Coding on Graphics Hardware using CUDA Wladimir J. van der Laan, Jos B.T.M. Roerdink, Andrei C. Jalba
Dirac Wavelet Video Codec (DWVC)
Video compression format
Open source, royalty‐free alternative to H.264; roughly equivalent quality
BBC Research
Dirac‐research – reference implementation
Schrödinger – high performance Heavily optimized Good basis for performance comparison
DWVC Decoding Stream data
Intra‐frames – self contained images
Inter‐frames – difference with respect to one or two reference frames
Arithmetic decoding – lossless; extracts parameters, vectors, coefficients from bitstream Reversed entropy coder, which represents common values with
shorter bit sequences
Little inherent parallelism – handled by CPU
Motion compensation Residue (difference from prediction) stored as wavelet coefficients
CUDA Implementation Use CUDA to avoid mapping decoding process to
rendering pipeline
Lifting scheme – less arithmetic, in‐place
Frame arithmetic – 16 vs 32 bit?
Sub pixel precision Bicubic interpolation of reference frame
Separable transformation for wavelet lifting
Decompose 2D op into 2 1D ops
Horizontal Pass Coalesced read part of a row
Duplicate border elements – boundary conditions
Shared memory: in‐place lifting
Syncthreads after each step in transform
Coalesced write back to global
Reorganized coefficients – based on JPEG 2000 cache‐efficient wavelet lifting
Vertical pass Substituting rows for columns ‐> poor coalescing
Each block processes multiple columns: a slab
Each row in a slab can be read with coalescing
Shared memory: transform on columns
Sliding window – not all columns can fit in shared
Motion compensation Block placement
Traditional Divide image into equally‐sized, disjoint blocks Strong discontinuities between neighboring blocks
Poor prediction on block edges
Overlapped Block Motion Compensation Overlaps neighboring blocks Blending together in shared area
Reference frame options Previous frame
Previous and next (blended together with some weights) – for fades
A different frame several frames back – if better match
Overlapped blocks Each pixel part of up to four motion compensation blocks
per frame
Naïve implementation Equally sized CUDA blocks Complicated flow control – neighboring pixels access
different motion comp. blocks
Solution: Divide image into regions
Based on number of and orientation of overlapping blocks Center – 1 block Edges – 2 blocks (H or V overlap), linear blend Corners – 4 blocks, bilinear blend
All pixels in a region have same code Each region is processed by one CUDA block No block divergent branching
Texture faster than constant memory Each thread potentially accesses a different location
Results Dual Core AMD Opteron 280 vs Nvidia GeForce GTX280,
CUDA 2.2
Single threaded GPU times do not require readback (video is displayed
through OpenGL textures) 5.4x overall speedup for entire decode process 13x speedup for GPU operations (arithmetic decoding
excluded) 1920x1080 (1080p) displayed at 56.4fps
25 fps needed for movie playback
10.5 fps for CPU reference
Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting Christian Tenllado, Javier Setoain, Manuel Prieto, Luis Piñuel,
Francisco Tirado
Focus on DWT Has other image processing/computer graphics
applications – multiresolution analysis
Primary methods: Filter bank Lifting scheme
Filter bank Given signal A:
Run low pass filter (convolution) on A to get low frequency approximation (~blur)
Run corresponding high pass filter on A to get high frequency details
Halve frequency of both (since we now have twice as much information as necessary)
Recurse on approximation
Direct translation of definition of wavelet transform
Lifting scheme Combine highpass and lowpass filters
Any FBS wavelet can be factorized into several LS steps with Polyphase Matrix representation
Split signal into odd/even values (lazy wavelet transform)
Predict
Update
LS Advantages Simple to invert: run in opposite direction (no reverse
convolution)
Method for producing wavelet transforms Control over the actual operations that are executed Can use integer operations ‐> lossless compression
Easy to generalize + must be invertible but doesn’t have to be +
Tends to be more efficient w.r.t. amount of hardware or power consumption for embedded systems
FBS vs LS Speed CPU: LS up to twice the speed of FBS
Performs about half as many computations Though actual gains are often smaller than theoretical
In‐place transform
LS is default way to implement wavelet transform – seen as most efficient
GPU: FBS is actually faster Fewer synchronization barriers
Implementation OpenGL + Cg
Layout: 2x2 locks stored in RGBA texel – allows H and V algorithms to be designed symmetrically
Filter bank – synch barrier between H and V filters
Lifting scheme Several loops to perform simple vector operations on each
data stream
Every LS step performed by a different kernel Many synch barriers
Results Execution times scale linearly with problem size
Ratio of LS time to FBS time ‐> constant as size grows
Speedups from Nvidia FX 5950 Ultra (2003) to 7800 GTX (2005) 4x for FBS 2.2x for LS
Results Key performance factor is # rendering passes and synch
barriers FBS doesn’t require pipeline flush, allows better
parallelization LS: removing synch barriers (incorrect output, but good
performance estimate) 1.4x speedup
GPU: 1.2 – 3.4x speedup over CPU implementation w/o data transfer
Transform a 4M pixel image in 9.12 and 17.9 ms using FBS and LS using Daubechies‐4 Slower times for more complicated wavelets
Future improvements LS/FBS time ratio grows as # shader processors increase –
future GPUs will progressively favor FBS
Waiting for better CPU/GPU integration
Suggest fusing consecutive kernels – increased complexity, but faster
Summary GPU allows several times speedup over CPU for
decompression with modern codecs May not seem dramatic, but helps cross barrier over movie
fps rate Allows more types of compression algorithms to become
feasible
Methods for implementation best for CPU may not be best for GPU