Texture Synthesis and Image Inpainting for Video ...cseweb.ucsd.edu/classes/fa04/cse252c/projects/sanjeev.pdfTexture Synthesis and Image Inpainting for Video Compression Project Report

Texture Synthesis and Image Inpainting for Video CompressionProject Report for CSE-252C

Sanjeev Kumar

Abstract

In this report we investigate the application of texture syn-thesis and image inpainting techniques for video compres-sion. Working in the non-parametric framework, we use 3Dpatches for matching and copying. This ensures temporalcontinuity to some extent which is not possible to obtainby working with individual frames. Since, in present ap-plication, patches might contain arbitrary shaped and mul-tiple disconnected holes, fast fourier transform (FFT) andsummed area table based sum of squared difference (SSD)calculation [8] cannot be used. We propose a modifica-tion of above scheme which allows its use in present appli-cation. This results in significant gain of efficiency sincesearch space is typically huge for video applications.

1. IntroductionAny semblance of order in an observation is a manifesta-tion of redundancy of its present representation. Such re-dundancies have been exploited for different purposes in in-numerable applications, one canonical example being datacompression. Specifically, in video compression, motioncompensation and transform coding are used to exploit thespatio-temporal redundancy in the video signal. But thetypes of redundancies exploited by current methods arerather limited. This is exeplified by the success of errorconcealment techniques. Morover, Discrete cosine trans-form (DCT), the most commonly used transform codingtechnique in video compression, has been found to be effec-tive only for small block sizes, which shows its inability toexploit redundancies extending upto larger extent. Wavelethas been relatively more successful in this respect for cer-tain classes of images but hasn’t found much use in generalpurpose video compression.It is interesting to compare aforementioned techniques withtexture synthesis and image inpainting both of which alsoassume the existence of redundancies in images and video,but exploit it for other purposes different from compression.The type of redundancy exploited by these techniques e.g.texture synthesis is somewhat complementary to DCT etc.Texture synthesis works on more global level. The motiva-tion behind present work was the perceptual quality of out-put of texture synthesis and image inpainting methods using

little amount of original data. Although, building a practicalvideo codec based on these principles would require moreresearch effort, present work is intended to be a small stepin that direction.We build upon the work of [3] which uses non-parametrictexture synthesis approach for image inpainting with appro-priate choice of fill order. The contribution of present workis two fold. First, we extend the work of [3] to three dimen-sions, which, we argue, is more natural setting for video.Second, we extend the work of [8], which allows to useFFT based SSD calculation for all possible translations of arectangular patch, to arbitrary shaped regions, thus makingit applicable to image inpainting.This report is organised as follows. In section 2 we brieflyreview the relevant literature. Section 3 describes the pro-posed approach. Experimental results have been presentedin section 4. Some future research directions are discussedin Section 5.

2. Previous work

Related prior work can be broadly classified into followingthree categories.

2.1 Texture Synthesis

Texture synthesis has been used in the literature to fill largeimage regions with texture pattern similar to given sam-ple. Methods used for this purpose range from parametric,which estimate parametrized model for the texture and useit for synthesis, e.g. Heeger et al. [2], to nonparametric, inwhich synthesis is based on direct sampling of the suppliedtexture pattern, e.g. Efros and Leung [1]. Texture synthe-sis methods have also been used to fill in small holes in theimage which might originate due to detereoration of imageor desire to remove some objects. But, aforementioned tex-ture synthesis methods have been found to work poorly forhighly structured textures which are common in natural im-ages. Graphcut based techniques [7] try to main structuralcontinuity during texture synthesis by finding optimal seamwhile copying texture patches from different portions forimage.

1

2.2 Image Inpainting

Image inpainting has been used in the literature to fill insmall holes in the image, by propogating structure informa-tion from image to the region to be filled. They typicallyuse diffusion to propogate linear strucure based on partialdifferential equation [5] [9]. These are found to performwell in filling small holes but produce noticeable blur whenfilling large holes. Recently, Criminisi et al. [3] proposeda method which combines the benefit provided by texturesynthesis and image inpainting. The results of their algo-rithm compare favorably with other state of the art in thefield without resorting to texture segementation or explicitstructure propagation and is able to fill in large holes. Weprimarily build upon their work and investivate some issueswhich are not so important for static images but becomerelevant for videos.

2.3 Applications in Video

Application of texture synthesis and image inpainting forvideo has been investigated in Bornard et al [6]. Whilesearching for the matching neighborhood their algorithmalso searches in previous and next frames apart from thecurrent frame. Although it provides robustness against er-ror, it doesn’t try to maintain temporal continuity which isvery important for video applications.

3. Proposed approachIn this section, we begin with brief recapitulation of thenotation and algorithm presented in [3]. For more details,reader should refer to the original manuscript.

3.1 Problem Statement and Some Notations

Given an image or videoI = Φ∪Ω, whereΦ is the source re-gion, set of pixels whose value is known andΩ is the targetregion (aka hole), set of pixels whose value is to be filled.δΩ represents boundary between source and target region,which is a contour in 2D and a surface in 3D. For all bound-ary pixelsp ∈ δΩ, ψp denotes a patch centered atp.

3.2 Inpainting in 2D

We assign a confidence term to every pixel on the boundary,which is given by

C(p) =

∑q∈ψp∩(I−Ω) C(q)

|ψp| (1)

Initially, C(p) = 0, ∀p ∈ Ω. andC(p) = 1, ∀p ∈ I − Ω.We also assign a data term to every pixel on the boundary,

which is given by

D(p) =|∇I⊥p .np|

α(2)

where, α is maximum value of intensity range used forimage ( e.g 255 ). ∇Ip represents gradient of the im-age andnp represents normal vector to the region bound-ary. Based on data and confidence terms a priority is as-signed to every pixel on the boundary, which is given byP (p) = C(p) ∗ D(p). The patch centered at pixel withmaximum priority valuep = arg maxp Pp is selected asthe starting point for filling. Search is performed over allpatchesψq ∈ Φ for best matching patchψq accordingto modified sum of squared difference criteriond(ψp, ψq)which includes only those pixels ofψp which are alreadyfilled in. Authors of [3] performed search in CIE Lap colorspace, but in present work we only deal with grayscale im-ages. But all the algorithms presented here can easily be ex-tended to any color space. Values for all pixelsr ∈ ψp ∩ Ωare copied from corresponding pixels inψq and confidenceterms are copied fromp. Data term is recomputed for pixelson the boundary created by recent fill and above steps arerepeated untill all the pixels get filled.The algorithm described so far is exactly same as that pre-sented in [3]. Now we describe, proposed extensions in thepresent work in following two subsections.

3.3 Extension to 3D

A naive aproach to extend the algorighm described aboveto video would be to treat video as a collection of frames orimages and perform the same operation on each frame in-dependently. There are two main disadvantages of this ap-proach. Because of temporal correlation among the framesa matching patch is also likely to be found in temporally ad-jacent frames especially if a portion of the region to be filled(and hence absent) in one frame is present in some otherneighboring frame. The other disadvantage is that evenif every filled frame is spatially coherent, temporal coher-ence is not ensured, which might result in visible artifactsin the video. For video texture synthesis, 3D spatio tempo-ral patches were used in [7]. Similarly, we propose to use3D spatio temporal patches in the present framework. Thisensures temporal consistency in the video frame because ofthe similar reasons which result in spatial consistency to bemaintained in 2D version of the present algorithm.In order to extend the present algorithm to 3D, we need gen-eralizations of confidence term of (1) and data term of (2).Generalization of confidence term is trivial since now wejust need to sum over pixels in the 3D patch. Generaliza-tion of data term is not that obvious. Although, we couldstill compute the spatio-temporal gradient of image inten-sity (∇Ip) and normal vector (np) to the boundary surface,

2

there is no unique perpendicular direction to the gradient ofimage intensity i.e.∇I⊥p is not unique. We propose follow-ing modified data term using cross product of vectors.

D(p) =|∇Ip × np|

alpha(3)

which is defined for both 2D and 3D, maintains the intuitionof propagating structural information and has a nice prop-erty of reducing to (2) in 2D. For calculating normal vec-tor (np) to boundary surfaceδΩ we use a different schemefrom [3] which is equally valid in 2D and 3D. We maintaina binary mask for representingΩ whose smoothed gradientgives us normal vector (np) to boundary surfaceδΩ.

3.4 FFT based search

Searching for best matching patch is the most computation-ally demanding part of the present algorithm. Possibilityof real time implementation critically depends on the effi-ciency of this search. In this subsection, we revisit the prob-lem of searching for matching patch, describe an algorithmpresented in [8] which solves a closely related problem effi-ciently but is not directly applicable in the present situation,propose a modification of above algorithm which makes itapplicable in the present situation.We look for the best match for the selected patchΨp in thesearch rangeΦ. Equivalently, we look for translations ofthe patch such that the portion of the patch overlapping thesurrouding region matches it well - only those translationsthat allow complete overlap of the patch with the surround-ing region are considered. The cost of a translation of thepatch is now defined as:

C(t) =∑

p∈Ψp∩(I−Ω)

|`(p− t)− `(p)|2 (4)

where, (p) denotes intensity or color value at pixelp. TheSSD-based search described in Eq.4 can be computationallyexpensive if the search is carried out naively. Computingthe cost C(t) for all valid translations isO(n2) where n isthe number of pixels in the image or video. If the domainof summation is simple rectangular region, the search canbe accelerated using Fast Fourier Transforms(FFT) [8]. Wecan rewrite 4 as:

C(t) =∑

p∈Ψp∩(I−Ω)

`(p− t)2 − 2`(p− t)`(p) + `(p)2 (5)

Third term is independent oft and can be discarded fromconsideration while minimizing overt. The first term in 5is sum of squares of pixel values over search region aroundthe patch. For non masked region it can be computed ef-ficiently in O(n) time using summed-area tables [10]. Thesecond term is a convolution of the patch with the image

search region and for non-masked resion can be computedin O(nlog(n)) time using FFT. But, summation domain in(5) is not a simple rectangular one but can contain arbi-trary shaped and disconnected holes and hence algorithmpresented in [8] is not applicable here.We propose following modification of aforementioned al-gorithm to make it applicable in the present situation.

For evaluating second term, the presence of the filled andthe unfilled pixels in the patch prohibits staright forwardimplementaion of the convolution. This problem can becircumvented by assuming the masked region of the patchdoes not contribute to the convultion sum. In practise theunfilled pixels are set to zero and the convolution can beevaluated as product of the Fourier transform of the twoterms.

First term consists of summation of square of intensi-ties of pixels of image which correspond to filled pixels ofpatch. For different values of translationt this summationcan be expressed as convolution of “squared” image withappropriate mask and hence can be efficiently calculated us-ing FFT.

4. Results

Figure 1: Original image used for 2D inpainting experiment

Image shown in Figure (1) was used for the 2D inpaint-ing experiment. A region from the original image was man-nually cropped. Resulting image is shown in Figure (2).The output of 2D inpainting is shown is Figure (3). Wenotice that linear structure propagation results in very goodrecovery of the original image.

Figure (4) shows another input image for the 2D inpaint-ing algorithm and corresponding output has been shownin Figure (5). This particular result shows that absence ofstrong linear edges and lack of pure texture causes 2D in-painting algorithm to fail. This difficulty is somewhat alle-viated in the case of video when information is missing inone frame but is present in other frames.

3

Figure 2: Input image for 2D inpainting: algorithm workswell on it

Figure 3: Output image from 2D inpainting: algorithm hasworked well

Figure (6) shows one frame of the input video for 3Dinpainting algorithm and corresponding output frame isshown in Figure (7). Neighboring frames didn’t have miss-ing pixels and hence recovering the missing data was rela-tively easy.

More results are available in the form of video (.avi) files.

5. Discussion and Future WorkUnlike image restoration and other applications of imageinpainting, automatic patch size selection is crucial forvideo applications. Dynamic nature of scene content pre-cludes use of a fixed patch size. For this pupose, automaticscale selection methods from scale-space theory might beuseful.For a practical video codec, inpainting alone may not besufficient as it assumes no data is available in the region tobe filled, thus making it difficult to have any control overrate-distortion performation. Some hybrid algorithm ofinpainting and restoration should be more useful.

Figure 4: Input image for 2D inpainting: algorithm doesn’twork well on it

Figure 5: Output image from 2D inpainting: algorithmhasn’t worked well

Non-parametric nature of present work makes it succeptipleto lack of sufficient amount of data. Even though a singleframe of high resolution video might contain millions ofpixel, for 243 dimensional space of9× 9× 3 patch millionpixels are very small sample, and asymptotic performanceguarantees of non-parametric method of probability den-sity estimation doesn’t apply. This can be alleviated tosome extent by using semiparametric approaches such asvideo epitomes (http://www.psi.toronto.edu/computerVision.html ).

References

[1] A. Efros and T.K. Leung, “Texture synthesis by non-parametric sampling.”, Proceedings of the Seventh ICCV, pp.1033-1038(1999).

[2] David J. Heeger and James R. Bergen, “Pyramid-based

4

Figure 6: Input image for 3D inpainting: adjacent framesdon’t have missing pixels

Figure 7: Output image from 3D inpainting: presenceof relavant information in temporaly adjacent frames hashelped the algorithm recover missing data

texture analysis/synthesis.”, ACM Siggraph’1995,pp.229-238(1995).

[3] Antonio Criminisi, Patrick Perez and Kentoro Toyama, ”Re-gion Filling and Object Removal by Exemplar-Based ImageInpainting.”, IEEE Transactions of Image Processing, Vol.13,No.9, Sep.(2004).

[4] J. Jia and C.K. Tang, “Image repairing: Robust image synthe-sis by adaptive ND tensor voting.”in Proc. Conf. ComputerVision and Pattern Recognition, Madision, WI, 2003.

[5] M. Bertalmio, G. Sapiro, V. Caselles and C. Ballester, “Im-age inpainting”, in Proc. ACM SIGGRAPH, July 2000,http://mountains.ece.umn.edu/ guille/inpainting.htm,pp.417-424.

[6] R. Bornard, E. Lecan, L. Laborelli and J.H. Chenot, “Miss-ing Data Correction in Still Images and Image sequences”,in Proc. ACM International Conference on Multimedia, Dec2002. pp.355-361.

[7] V. Kwatra, A. Schodl, I. Essa, G. Turk and A. Bobick,“Graphcut Textures: Image and Video Synthesis Using GraphCuts”, in Proc. SIGGRAPH 2003.

[8] S. L. Kilthau, M. Drew and T. Moller, “Full search contentindependent block matching based on fast fourier transform”,in Proc. ICIP 2002,I-669-672.

[9] T. F. Chan and J. Shen, “Variational ImageInpainting”,ftp://ftp.math.ucla.edu/pub/camreport/cam02-63.pdf

[10] F. C. Crow, “Summed-area tables for texture mapping”, inProc. 11th annual conference on Computer graphics and in-teractive techniques, 207-212.

5

Documents

Texture Synthesis and Image Inpainting for Video ...cseweb.ucsd.edu/classes/fa04/cse252c/projects/sanjeev.pdfTexture Synthesis and Image Inpainting for Video Compression Project Report