Scalable Media Coding Applications and New Directions

UNSW – EE&T

Scalable Media CodingApplications and New Directions

David Taubman

School of Electrical Engineering & TelecommunicationsUNSW Australia

UNSW – EE&T

Taubman; PCS’13, San Jose 2

Overview of Talk• Backdrop and Perspective

– interactive remote browsing of media– JPEG2000, JPIP and emerging applications

• Approaches to scalable video– fully embedded approaches based on wavelet lifting– partially embedded approaches based on inter-layer prediction– important things to appreciate

• Key challenges for scalable video and other media– focus on motion, depth and natural models

• Early work on motion for scalable coding– focus: reduce artificial boundaries

• Current focus: everything is imagery– depth, motion, boundary geometry– scalable compression of breakpoints (sparse innovations)– estimation algorithms for joint discovery of depth, motion and geometry– temporal flows based on breakpoints

• Summary

UNSW – EE&T

Emerging Trends• Video formats

– QCIF (25 Kpel), CIF (100 Kpel),4CIF/SDTV (½ Mpel), HDTV (2 Mpel)UHDTV 4K (10 Mpel) and 8K (32 Mpel)

– Cinema: 24/48/60 fps; UHDTV: potentially up to 120 fps– HFR: helmet cams & professional cameras now do 240fps

• Displays– “retina” resolutions (200 to 400 pixels/inch)– what resolution video do I need for an iPad? (2048x1536?)

• Internet and mobile devices – vast majority of internet traffic is video– 100’s of separate non-scalable encodings of a video

• http://gigaom.com/2012/12/18/netflix-encoding

• New media: multi-view video, depth, …


UNSW – EE&T

Embedded media• One code-stream, many subsets of interest

4Taubman; PCS’13, San Jose

Low res

Medium res

High res

Compressed bit-streamLow quality

Medium quality

High quality

Low frame rate

Medium frame rate

High frame rate

• Heterogeneous clients (classic application)– match client bandwidth, display, computation, ….

• Graceful degradation (another classic)– more important subsets protected more heavily, …

• Robust offloading of sensitive media– backup most important elements first

UNSW – EE&T

Our focus: Interactive Browsing• Image quality progressively improves over time

• Video quality improves each time we go back

• Region (window) of interest accessibility– in space, in time, across views, …

– can yield enormous effective compression gains

– prioritized streaming based on “degree of relevance”• some elements contribute only partially to the window of interest

• Related application: Early retrieval of surveillance– can’t send everything over the air

– can’t wait until the plane lands

Taubman: PCS’13, San Jose 5

UNSW – EE&T

Scalable images – things that work well• Multi-resolution transforms

– 2D wavelet transforms work well

6

• Accessibility through partitioned coding of subbands– Region of interest access without any blocking artefacts

ECDZQ R-D curve(step size modulation)

)1(1I )1(

2I )1(3I)1(

1I

)2(0I )2(

1I

2

4

)2/1(2

)4/1(4

)1(0I

Bit-plane coding(truncation)

L

D2 coding passes

per bit-plane

• Embedded coding– Successive refinement through bit-plane coding– Multiple coding passes/bit-plane improve embedding

Taubman; PCS’13, San Jose

UNSW – EE&T

7

JPEG2000 – more than compression

Decoupling and embedding

embeddedembedded

code-block code-block

bit-streams bit-streams

embeddedembedded

code-block code-block

bit-streams bit-streams

LLLL22

LHLH22HHHH22

HLHL22

HLHL11

HHHH11LHLH11


UNSW – EE&T

8


Spatial random access


UNSW – EE&T

9


Quality and resolution scalability

LLLL22

LHLH22 HHHH22

HLHL22

HLHL11

HHHH11LHLH11

layer 1layer 1layer 2layer 2layer 3layer 3

quality layers


UNSW – EE&T

10

JPEG2000 – JPIP interactivity (IS15444-9)

• Client sends “window requests”– spatial region, resolution, components, …

• Server sends “JPIP stream” messages– self-describing, arbitrarily ordered– pre-emptable, server optimized data stream

• Server typically models client cache– avoids redundant transmission

• JPIP also does metadata– scalability & accessibility for text, regions, XML, …

Cache Model

imagerywindow request

JPIP Server JPIP Client

Target(file or code-stream) Decompress/render

ApplicationJPIP stream + response headers

Client Cache

window

window

status


UNSW – EE&T

11

What can you do with JPIP?

• Highly efficient interactive navigation within– large images (giga-pixel, even tera-pixel)– medical volumes– virtual microscopy– window of interest access, progressive to lossless– interactive metadata

• Interactive video– frame of interest– region of interest– frame rate and resolution of interest– quality improves each time we go back over content

Panoramic Video Demo

Aerial Demo

Album Demo

Catscan Demo

Campus Demo


UNSW – EE&T








• Summary

UNSW – EE&T

Wavelet-like approaches

• Lifting factorization exists for any FIR temporal wavelet transform

• Warp frames prior to lifting filters, j

– motion compensation aligns frame features– invertibility unaffected (any motion model)

* * 11

** 22

** L-1L-1

** LL

Even framesEven frames

Odd framesOdd frames

Low-pass framesLow-pass frames

High-pass framesHigh-pass frames

warpwarpframeframe

ss

warpwarpframesframes

warpwarpframeframe

ss

warpwarpframesframes

Motion Compensated Temporal Lifting


UNSW – EE&T

Motion Compensated Temporal Lifting

• Equiv. to filtering along motion trajectories– subject to good MC interpolation filters– subject to existence of 2D motion trajectories

• Handles expansive/contractive motion flows– motion trajectory filtering interpretation still valid for all but

highest spatial freqs– subject to disciplined warping to minimize aliasing effects

• Need good motion!– smooth, bijective mappings wherever possible– avoid unnecessary motion discontinuities (e.g., blocks)


UNSW – EE&T

t+2D Structure with MC Lifting

MCLifting

MCLifting

SpatialDWT

SpatialDWT

SpatialDWT

SpatialDWT

SpatialDWT

SpatialDWT

MCLifting

MCLifting

t+2D Analysis t+2D Synthesis

H1

H2

L2

H2

L2

H1

Spatio-Temporal Subbands

input frames

Spatial resolutionSpatial resolution

Qualit

yQ

ualit

yTemporal

Temporal

resolution

resolution

Dimensions of Scalability:Dimensions of Scalability:

NB: other interestingNB: other interestingstructures existstructures exist


UNSW – EE&T

Inter-layer prediction (SVC/SHEVC)


0011

2233

44

00

22

44

00

44

0011

2233

44

00

22

44

00

44High Resolution

Layer

Low ResolutionLayer

Multiple predictors• pick one?• blend?

Implications for embedding …

UNSW – EE&T

Important Principles• Scalability is a multi-dimensional phenomenon

– not just a sequential list of layers

• Quality scalability critical for embedded schemes– low res subset needs much higher quality at high res

• Physical motion scales naturally– not generally true for block-based approximations– but, 2D motion discontinuous at boundaries

• Prediction alone is sub-optimal– full spatio-temporal transforms preferred

• improve the quality of embedded resolutions and frame rates• noise orthogonalisation

– ideally orthogonal transforms• note body of work by Flierl et al. (MMSP’06 thru PCS’13)


UNSW – EE&T

18

00

1

2/)(ˆ0 g)(ˆ1 g

Temporal transforms: Why prediction alone is sub-optimal

2

1

2

1

12 kf

kf2 22 kfevenframes

oddframes

residual

forward transform

2

1

2

1

12 kf

kf2 22 kf

reverse transform

12 ky 12 ky

2qL

2qH

quantization

1

-½-½

1

0H

1H

0G

1G

2qL

2qH

2

2

2

2

kf

1

½½ 1

Redundant spanningof low-pass content byboth channels

High-pass quantizationnoise has unnecessarilyhigh energy gain.

Bi-directionalprediction

Taubman: PCS’13, San Jose

UNSW – EE&T

19

Reduced noise power through lifting

• Inject –ve fraction of high band into low band synthesis path– removes low freq. noise power from

synthesized high band

• Add compensating step in the forward transform– does not affect energy compacting

properties of prediction

2

1

2

1

12 kf

kf2 22 kfevenframes

oddframes

2

1

2

1

12 kf

kf2 22 kf

12 ky 12 ky

2qL

2qH

22 kfky2

00

1

2/)(ˆ0 g

)(ˆ1 g

12 ky12 ky

4

1

4

1

12 ky12 ky

ky2

4

1

4

1


UNSW – EE&T








• Summary

UNSW – EE&T

Motion for Scalable Video• Fully scalable video requires scalable motion

– reduce motion bit-rate as video quality reduces– reduce motion resolution as video resolution reduces

• First demonstration (Taubman & Secker, 2003)

– 16x16 triangular mesh motion model– Wavelet transform of mesh node vectors– EBCOT coding of mesh subbands– Model-based allocation of motion bits to quality layers– Pure t+2D motion-compensated temporal lifting


UNSW – EE&T

Scalable motion – very early results

22

0 200 400 600 800 1000 120018

20

22

24

26

28

30

32

0 200 400 600 800 1000 120018

20

22

24

26

28

30

32

34

Lum

inan

ce P

SN

R (

dB)

Lum

inan

ce P

SN

R (

dB)

Lum

inan

ce P

SN

R (

dB)

Lum

inan

ce P

SN

R (

dB)

Bit-Rate (kbit/s)Bit-Rate (kbit/s)

Non-scalableBrute-force

Model-basedLossless motion

Non-scalableBrute-force

Model-basedLossless motion

CIF Bus CIF Bus at QCIF resolution

Bit-Rate (kbit/s)Bit-Rate (kbit/s)

2.01.81.61.41.21.00.80.60.40.2

36

34

32

30

28

26

24

22

20

38

CIF mobile

Lum

inan

ce P

SN

R (

dB)

Lum

inan

ce P

SN

R (

dB)

Bit-Rate (Mbit/s)Bit-Rate (Mbit/s)

5/3 temporal lifting

1/3 (hierarchical B-frames)

H.264 high complexity H.264 results

• CABAC• 5 prev, 3 future ref frames• multi-hypothesis testing• (courtesy of Marcus Flierl)


UNSW – EE&T

Motion challenges• Issues:

– smooth motion fields scale well• mesh is guaranteed to be smooth and invertible everywhere

– but, real motion fields have discontinuities

• Hierarchical block-based schemes– produce a massive number of artificial discontinuities– not invertible – i.e., there are no motion trajectories– non-physical – hence, not easy to scale– but, easy to optimize for energy compaction

• particularly effective at lower bit-rates

• Depth/disparity has all the same issues as motion– both tend to be piecewise smooth media– NB: bandlimited sampling considerations may not apply

• boundary discontinuity modeling more important for these media


UNSW – EE&T

24

Aliasing challenges

Analysis filter responses of thepopular 9/7 wavelet transform

1)(ˆ)(ˆ)(ˆ)(ˆ1010 hhhh

Fundamental constraint:(for perfect reconstruction)

half-band filter

00

1

2/

)(0̂ h )(1̂ h

)(1̂ h

Extract LLsubband

Spatial aliasing


UNSW – EE&T

Spatial scalability – t+2D

• Temporal transform uses full spatial resolution

25

temporalpredict

temporalupdate

• At reduced spatial resolution– Temporal synthesis steps missing high-resolution info

– If motion trajectories wrong/non-physical ghosting

– If trajectories valid temporal synthesis reduces aliasing• less aliasing than regular spatial DWT at reduced resolution


UNSW – EE&T







• Current focus: everything is imagery– depth, motion, boundary geometry– scalable compression of breakpoints (sparse innovations)– estimation algorithms for joint discovery of depth, motion and geometry– high level view of the coding framework we are pursuing

• Summary

UNSW – EE&T

Block-based schemes with merging

• Linear & affine models– encourages larger blocks

• Merging of quad-tree nodes– encourages larger regions and improves efficiency– merging approach later picked up by the HEVC standard

• Hierarchical coding– works very well with merging; provides resolution scalability

27

Affine motion model

Linear motion model

Pure translation

leafmerging

(Mathew and Taubman, 2006)


UNSW – EE&T

Boundary geometry and merging• Model motion & boundary• No merging

– Hung et al. (2006)– Escoda et al. (2007)

• With merging– Mathew & Taubman (2007)– separate quad-trees (2008)

28

M2M1

Motion Comp (M2)

Motion Comp (M1)

Combined result


Motion only Motion + boundary Separate quad-trees

UNSW – EE&T

Indicative Performance

• Things that reduce artificial discontinuities:– modeling geometry as well as motion– separately pruned trees for geometry and motion– merging nodes from the pruned quad-trees

• These schemes are practical and resolution scalable– readily optimized across the hierarchy

29

Two quad-trees:motion, geometry,

merging

Single quad-tree:motion, geometry,

merging

Single quad-tree:motion only

Single quad-tree:motion + merging


UNSW – EE&T







• Current focus: everything is imagery– depth, motion, boundary geometry– scalable compression of breakpoints (sparse innovations)– estimation algorithms for joint discovery of depth, motion and geometry– temporal flows induced by breakpoints

• Summary

UNSW – EE&T

Geometry from arc breakpoints

• Breaks can represent segmentation contours– but don’t need a segmentation– breaks are all we need for adaptive transforms– direct RD optimisation is possible

• Natural resolution scalability– finer resolution arcs embedded in coarser arcs


Coarse grid Finer grid

ArcsGridpoints

UNSW – EE&T

Finer grid

Breakpoint induction and scalability

• Explicitly identify subset of breakpoints – “vertices”– remaining breakpoints get induced– position of break on its arc impacts induction

• Representation is fully scalable– resolution improves as we add vertices at finer scales– quality improves as we add precision to breakpoint positions

• Breakpoint representation is image-like– vertex density closely related to image resolution– vertex accuracy like sample accuracy in an image hierarchy


Coarse grid

direct inductiononto sub-arcs

spatial inductiononto root-arcs

spatial inductiononto root-arcs

UNSW – EE&T

33

Breakpoints drive an adaptive DWT

Basis functions do not cross discontinuity along an arc

Max of one breakpoint per arc Adaptive transform well defined

Arc BreakpointPyramid

Field SamplePyramid

Original field samples

Breakpoint Adaptive Transforms (BPA-DWT) – sequence of non-separable 2D lifting steps

P1-step

U1-step

P2-step

U2-step


UNSW – EE&T

root arcs

Arc-bands and sub-bands

34

Sub-bands (BPA-transformed data)Arc-bands (vertices)

Level 2

Level 1

Level 0

codeblocksnon-root

arcs


UNSW – EE&T

Embedded Block Coding– for scalability and ROI accessibility

• Sub-band stream (field samples)– Sub-bands divided into code blocks– Coded using EBCOT (JPEG2000)– Bitplanes assigned to quality layers

• Vertex Stream– Arc-bands divided into code blocks– Coding scheme similar to EBCOT– Bitplanes refine vertex locations– Bitplanes assigned to quality layers

1

0

1

1

0

0

X

1

0

1

0

0

0

X

0

0

1

0

0

0

X

1

0

0

0

0

0

X

LSB

λ1

λ2

λ3

Vertex stream

Sub-band stream


UNSW – EE&T

Fully Scalable Depth Coding– field samples are depth; breaks are depth discontinuities

36

JPEG 2000, 50 k bits• Resolution scalable• Quality scalable• No blocks

Proposed, 50 k bits• Resolution scalable• Quality scalable• No blocks

Poorly suited todiscontinuities in

depth/motion fields

Well suited todiscontinuous

depth/motion fields


UNSW – EE&T

R-D Results for Depth Coding


Breakpoint-adaptive: vertex stream is truncated at a quality level sub-band stream decoded

progressively

UNSW – EE&T

Model-based quality layering

38

• Scaled by discarding sub-band and arc-band quality layers– fully automatic model-based quality layer formation– model-based interleaving of all quality layers for optimal embedding


UNSW – EE&T

Model-based quality layering

39

• Compared with segmentation based approach(Zanuttigh & Cortelazzo, 2009)– not scalable; sensitive to initial choice of segmentation complexity


UNSW – EE&T

Fully scalable motion coding – preliminary

• Field samples are motion vectors– motion coded using EBCOT after BPA-DWT

• Breakpoints and motion jointly estimated– compression-regularised optical flow:

– Coded length L provides the prior (regularisation)• reflects cost of scalably coding arc breakpoints• plus cost of coding motion wavelet coeffs after BPA-DWT

– Distortion D provides the observation model


UNSW – EE&T

Optimisation framework• Approach based on loopy belief propagation• Breakpoints/vertices connected via inducing rules

– turns out to be very efficient• complexity grows only linearly with image size

• Breakpoints/motion connected via BPA-DWT– initial simplification: motion at each pixel location drawn

from a small finite set (cardinality 5)– initial motion candidates generated by block search

with varying block sizes

• System always converges rapidly• Use modes of the marginal beliefs at each node


UNSW – EE&T

Initial scalable motion results

• 2 frames, inter-predict only; occlusions removed– newer work eliminates restriction to finite motion set


JPEG2000

H.264

trunc motion bits

trunc vertexand motion bits

UNSW – EE&T

Visualising the breaks & motion


UNSW – EE&T

Temporal induction of motion – preliminary

• Given:– motion from f1 to f2

– plus breakpoint fields in f1 and f2

• Induce:– motion from f2 to f1

– resolve ambiguities, find occlusions


f1 f2

UNSW – EE&T

Synthetic example:


Inferred horizontal motion

Inferred vertical motion

Inferred occlusion

Resolvedambiguities

UNSW – EE&T

Temporal induction of breaks – preliminary

• Given:– motion from frame f1 to frame f3

– plus breakpoint fields in f1 and f3

– plus motion from f1 to f2 (real or estimated)

• Induce:– breakpoint field in frame f2

– hence estimate all other motion fields– note: induced occlusions are temporal breaks


f1 f2 f3

UNSW – EE&T

Synthetic example:


Horizontal breaksFrame 1Horizontal breaksFrame 2 inducedHorizontal breaks

Frame 3

UNSW – EE&T

Things we are working on• Message approximation strategies for joint inference

of breakpoint fields with motion– compression regularized optical flow– continuous motion, notionally at infinite precision

• Advanced breakpoint coding techniques– including differential coding schemes

• i.e., geometry transforms

– including inter-block conditional coding schemes that do not break the EBCOT paradigm

• Breakpoint adaptive spatio-temporal transforms– breakpoints potentially address most of the open issues

surrounding motion compensated lifting schemes

• Integration within our JPEG2000 SDK (Kakadu)– interactive access, JPIP for remote browsing, etc.


UNSW – EE&T

Demo placeholder• Interactive browsing of breakpoint media over JPIP


UNSW – EE&T

Arc breakpoints vs graph edges• arcs = possible edges on spatial graph

– breaks = missing edges– main point of divergence

• arc breakpoints have locations (key to induction & scalability)• graph edges may be weighted (equivalent to “soft” breaks)

• Related approaches: “Graph based transforms for depth video coding”

Kim, Narang and Ortega, ICASSP’12 “Video coder based on lifting transforms on graphs”

Martinez-Enriquez, Diaz-de-Maria & Ortega, ICIP’11– Edges (binary breaks) from segmentation (JBIG encoded)– Motion (block based) produces temporal edges– Spatio-temporal video transform adapts to graph– hierarchical, but not scalable in the usual sense


UNSW – EE&T

Summary• Interactive image browsing benefits from

– embedded transforms and embedded block coding– intelligent servers send only a fraction of total content– getting only what you want ~ huge effective coding gain

• Challenges for interactive video– fully embedded motion fields– bijective motion fields wherever this is physical

• Key: fully scalable model of motion discontinuities– allows motion to be invertible almost everywhere– ideal for motion compensated 3D wavelet xforms– allows motion to scale naturally in space and time

• Numerous research challenges …


UNSW – EE&T

Important Acknowledgements• Reji Mathew

– scalable motion, scalable depth & breakpoint estimation

• Sean Young– scalable motion coding results

• Dominic Rüfenacht– temporal induction of motion and breakpoints

• Pietro Zanuttigh– breakpoint estimation for depth coding

• Australian Research Council– much of the presented research was partially funded

by the ARC Discovery Projects scheme


Documents

Scalable Media Coding Applications and New Directions