2D Video Editing for 3D Effects - RWTH Aachen University · 2D Video Editing for 3D Effects Darko Pavic, Volker Schoenefeld, Lars Krecklau, Martin Habbecke, Leif Kobbelt´ Computer

2D Video Editing for 3D Effects

Darko Pavic, Volker Schoenefeld, Lars Krecklau, Martin Habbecke, Leif Kobbelt

Computer Graphics Group, RWTH Aachen UniversityEmail: pavic,[email protected]

Abstract

We present a semi-interactive system for advancedvideo processing and editing. The basic idea is topartially recover planar regions in object space andto exploit this minimal pseudo-3D information inorder to make perspectively correct modifications.Typical operations are to increase the quality of alow-resolution video by overlaying high-resolutionphotos of the same approximately planar object orto add or remove objects by copying them fromother video streams and distorting them perspec-tively according to some planar reference geome-try. The necessary user interaction is entirely in2D and easy to perform even for untrained users.The key to our video processing functionality is avery robust and mostly automatic algorithm for theperspective registration of video frames and photos,which can be used as a very effective video stabi-lization tool even in the presence of fast and blurredmotion. Explicit 3D reconstruction is thus avoidedand replaced by image and video rectification. Thetechnique is based on state-of-the-art feature track-ing and homography matching. In complicated andambiguous scenes, user interaction as simple as 2Dbrush strokes can be used to support the registration.In the stabilized video, the reference plane appearsfrozen which simplifies segmentation and matte ex-traction. We demonstrate our system for a numberof quite challenging application scenarios such asvideo enhancement, background replacement, fore-ground removal and perspectively correct video cutand paste.

1 Introduction

Thanks to advances in technology, today’s dig-ital still and video cameras are becoming moreand more affordable. The amount of digitalimage data created and stored on storage de-vices all around the world is immense. Google

(www.google.com), Flickr (www.flickr.com),YouTube (www.youtube.com) are only a fewexamples of many possibilities to obtain this dataover the internet. The increasing amount of imagedata induces the demand for appropriate image andvideo processing methods where one is especiallyinterested in improving the image data quality orchanging the content.

The fact that on the one hand low-cost digital stillcameras can produce high-quality, high-resolutionimages and on the other hand almost every mobilephone can capture videos, which in general havepoor quality, inspired us to develop the system pre-sented in this paper. When using a video camerain everyday life, e.g., during a holiday for takingsome short memos of the sights one often makesthe experience that the results are unsatisfying formainly two reasons: First, the videos are in gen-eral shaky, because holding the video camera stillis difficult even for a professional without appropri-ate equipment and second, the videos taken with anamateur camera have in general much lower reso-lution than the still images and consequently muchlower quality. In this paper we present a systemwhich is able to stabilize even very strong move-ments of the camera and thus increase the qualityof the video data. The main idea we propose hereis simple: in the given input video we track pla-nar regions through the video and generate appro-priate homography mappings between the frames(notice, that here we apply standard, state-of-the-art techniques [HZ03]). Then we are able to sta-bilize the video by choosing one of the frames asthe reference frame and rectify all other frames tothis one by using the corresponding homographymappings. The whole process is semi-automatic.First, we extract the image features and cluster thesefeatures using a RANSAC approach [FB81], whereeach cluster represents an approximately planar re-gion in the scene. User interaction, if needed at all,consists of only painting a few brush strokes in or-

VMV 2008 O. Deussen, D. Keim, D. Saupe (Editors)

Figure 1: Our video editing system is built upon a pure 2D interface and is based on standard homographymatching. Each tracked region can be easily modified in one frame and this new information, e.g. here apainting, can automatically be propagated through the whole video, thus creating the desired 3D effect.

der to mark the planar regions of interest and thusprovide some additional ”semantic” information tothe system. Our registration method provides a per-pixel correspondence of the tracked planar regions.This means that in the rectified stream of imageswe can assume that all pixels having the same spa-tial position contain approximately the same imageinformation. The degree to which this assumptionholds, depends on the actual planarity of the region.For relief structures such as facades it usually stillworks sufficiently well.

Depending on the application scenario additionalstill images or videos can be registered to the cur-rent data using the same idea. The frozen videoparts can then be used as canvas regions for chang-ing the video content. We can use a high-resolutionimage in order to improve the quality of a low-resolution and low-quality video, or apply im-age completion techniques and propagate the resultthrough the stream and thus perform video comple-tion. Furthermore we can simply paste new contentlike still images or videos.

Our main contribution is the development of aunifying framework for several different video edit-ing tasks. We apply existing feature correspondenceand image matching techniques in order to robustlyperform video stabilization and registration. Themain editing metaphor of our system is based onthe perspectively correct transformation of imageregions using homographies. This enables evencomplex editing operations in a simple 2D inter-face, which otherwise would require difficult userinteractions or even tedious frame-by-frame opera-tions. The flexibility of our framework allows us toeasily perform video-to-image, image-to-video andvideo-to-video editing operations all using the sameintuitive user interactions. By this we can create ap-parent 3D effects without having to generate a 3Dreconstruction of the scene.

2 Related workOur matching and tracking procedure is based onthe SIFT features which were introduced by DavidLowe [Low04]. The SIFT features are character-ized by very important properties which allow for avery robust and reliable matching, namely the fea-tures are high-dimensional and hence easy to dis-tinguish. Moreover they are invariant to scale, ro-tation and translation. The quality of SIFT fea-tures was explored and proven, e.g. by Mikolajczykand Schmid [MS05]. Many recent approaches usedSIFT features for matching, e.g. Brown and Lowe[BL03] apply SIFT-based homography matching intheir method for generating panoramas, Snavely etal. [SSS06] used SIFT-features in the context of aninteractive image browsing system, and the resultsfrom that paper were also used by Agarwala et al.[AAC∗06] for creating multi-viewpoint panoramas.

Aligning images is a well-investigated prob-lem in computer vision and there is a large num-ber of methods which are based on standard ap-proaches like optical flow [BB95] or stereo match-ing [SSZ01]. In our image alignment based reg-istration we use a variation of the Lucas-Kanadealgorithm [BM04]. Aligning images from two in-put videos, having similar camera trajectories andcapturing a similar scene frame-by-frame leads to aVideo Matching approach [ST04] which is relatedto our method. However, our approach is concep-tually different since we apply planar matching andthe input videos do not have to show the same scene.

Video editing in combination with computer vi-sion techniques as done in our paper was also ap-plied in other contexts for visualization purposes[BBS∗08, WC07].

Image completion was exhaustively explored inmany research papers in recent years [SYJS05,DCOY03,CPT03]. Pavic et al. [PSK06] have intro-duced a technique for embedding perspective cor-rection in the image completion process by piece-

wise linear approximation of the underlying scene.Their image completion can be done using severalregistered input images, a process called multiviewimage completion. We adopt the rectification ideafrom this paper and extend it to videos applying itnot only for the completion but also for other appli-cation scenarios.

In general applying image completion methodsin the context of video completion won’t work prop-erly because here one has to take care of temporalcoherence. Wexler et al. [WSI04] describe videocompletion as global optimization problem. Jia etal. [JHM05] use tracking of moving objects in or-der to improve a fragment-based completion pro-cess. Our registration procedure aligns video dataso that temporary neighbored pixels describe ap-proximately the same part of the scene and so wecan reduce the video completion problem to an im-age completion problem at least for an approxi-mately planar part of the scene which we have usedfor the matching.

There are several methods dealing with the prob-lem of cutting out objects from a video whichis closely related to the general problem of theforeground-background segmentation or simply thematting problem [CCSS01, SJTS04, LLW06]. Do-ing video cut and paste [WXSC04, WBC∗05,LSS05] requires static cameras or stabilized videodata in general. This is exactly the setting we arecreating with our method, hence enabling the appli-cation of these methods to a wider range of videos.

The operation of overlaying a low-resolutionvideo with a high-resolution image can be under-stood as a texture replacement on 3D objects. Tex-ture replacement in photos has been addressed bye.g. Liu et al. [LLH04]. Recently, Bhat et al.[BZS∗07] have proposed a method for video en-hancement by transferring high-resolution photos tolow-resolution videos. They compute a 3D recon-struction of the scene using a multiview-stereo algo-rithm. Van den Hengel et al. [vdHDT∗07] propose amethod for interactive 3D reconstruction of objectsfrom video. In contrary to these two methods ourapproach is purely 2D and thus completely avoids3D reconstruction which makes it easier to handlelow quality footage and ambiguous configurations.

There is a number of approaches dealing withvideo enhancement. Image deblurring is used inorder to increase the quality of an image. Inthe context of super-resolution a sequence of low-

resolution images is used in order to create onehigh-resolution still image [BBZ96]. Matsushita etal. introduced a method for deblurring a video bycopying image details between neighboring frames[MOTS05]. In this work they also addressed thestabilization problem, but their approach smoothesthe camera movement whereas our approach com-pletely freezes the movement. Irani and Anandan[IA98] apply modifications on the mosaic represen-tation of the video and propagate it through the se-quence. Our method does not use mosaic repre-sentations but only one registered frame where thefrozen plane is used as ordinary canvas.

A proper comparison of our system with com-mercially available tools is not possible due to bud-get issues and lack of internal information. To thebest of our knowledge most of these tools (e.g.,Maya Live or Boujou) each apply some kind of 3Dreconstruction which is the main difference to oursystem. The most similar tool to our systems seemsto be the Monet tool (www.imagineersystems.com).There, just like in our case, some sort of homogra-phy matching is applied, but it is exclusively basedon tracking features. As a consequence it does notwork well in homogeneous regions and it has to relyon user input to constrain the possible motions (e.g.max. translation and rotation). Our system, in addi-tion, allows for image based homography matching(Section 4.3) which does not require the identifica-tion of features. The image-based technique provedto be more stable and more reliable especially whenthere are not many features to be tracked. No ad-ditional user input is required to estimate the maxi-mum shift and rotation. With perspectively correctobject cut and paste between videos we introduce aversatile 3D video effect which is easy to achievewith our 2D system and would be much harder withall the above mentioned tools.

3 System OverviewIn order to achieve a realistic result when pasting asnippet from one still image into another, it is nec-essary to apply a perspective correction. In generalthis is not possible since a single image does notcontain the required 3D information for this oper-ation. However, if the object to be copied is ap-proximately planar then the perspective correctionis defined by a 2D homography, which can be com-puted from 8 constraints, i.e. from 4 pairs of featurepoints in the respective images [HZ03]. Even if the

H 1H 1

Id M Id

Reference frame

Source stream Target streamRegisteredSource stream

RegisteredTarget stream

Reference frame

H 2 H 2

Figure 2: System overview. In the most general set-ting two video streams are perspectively registeredto their respective reference frames. A homographyM between the two reference frames provides thenecessary information to copy image content fromframe Ei in one stream to frame Fj in the otherstream. The perspectively correct distortion is com-puted by concatentation of the corresponding ho-mographies: (H ′

j)−1 M Hi.

object to be copied is not exactly planar, the resultwill still look acceptable if the deviation from a ref-erence plane is not too strong.

Now this idea can be taken one step further andcan be applied to video streams. Again, to copyan object from one video into another, we have toapply a perspective correction. This correction canbe approximated by a 2D homography if the objectis sufficiently planar or the difference in orienta-tion between source and target perspective is not toolarge. However, the major difference between thetwo scenarios is that in the case of a video stream,the relative position of the camera with respect to areference plane in object space can change.

Hence our system is based on a robust technique,which allows us to track an approximately planarregion through the frames of a video. From thetracking information, we compute homographiesthat perspectively align (register) each frame of thevideo stream to a reference frame. In this perspec-tively registered video stream, the planar region ap-pears “frozen” such that we, in fact, have playedback the video setting to the setting of still images.

In the most general application scenario, we copyan object from one video with moving camera (ormoving reference plane) into another video withmoving camera. Here the transformations go fromthe original frames to the reference frames in the

corresponding streams. Then the actual copy op-eration is applied from the reference frame of onevideo stream into the reference frame of the othervideo stream (see Fig. 2). Notice that due to theplanarity assumption all operations can be done in2D such that there is no need for any explicit re-construction of 3D information in the scene. Thismakes the user interaction easier and the computa-tion more robust (even for small base-lines).

There are various special cases of this generalscenario that also lead to very useful video pro-cessing operations. For example, if just one videostream is used, the perspective registration can beused as a video stabilization tool for shaky footagetaken with a hand-held camera. If we have onevideo stream plus one photograph, we can copy(parts of) the photo as a texture onto a moving ob-ject in the video. Since still photos usually havea much higher quality than videos, this operationcan be used to effectively improve the visual qual-ity. If multiple planar regions should be processedin the same scene (video), multiple rectified videostreams are generated independently and for eachediting operation the corresponding homographiesare used. Various applications will be demonstratedin the result section (Section 6).

3.1 User Interface

The user interface in our system is simple and in-tuitive. The editing operations are broken downto their atomic components and are represented bynodes of a data flow graph. For a specific videoediting task the user creates such a graph by takingthe needed nodes per drag-and-drop and connectingthem as needed. A simple example of such a graphis shown in Fig. 3.

Each node has a number of possible inputs (left)and outputs (right), which are depicted by the big(red or green) dots. By connecting the differentnodes the user intuitively defines the flow of theinformation in the graph. Green connections arevalid, whereas the red connections indicate thatthere is no data available. By clicking on one ofthe nodes in the graph the corresponding applica-tion unit is opened, e.g. if clicking on a ”Tracking”node, the tracked stream is shown and the user canmask regions for tracking or define the homographymapping by using the quad metaphor (see Fig. 8).

The graph in Fig. 3 corresponds to the video edit-ing example shown in Fig. 1. We have there two

Figure 3: Composition graph created with our system for the video editing task presented in Fig. 1.

input nodes: ”VideoResource” for the input videoand ”ImageResource” for the input image. Theediting operation we want is to replace the paint-ing in the video with the source image. Thereforethe video source is tracked (”Tracking”-node) andmatted (”VideoMatte”-node). Each registered andrectified frame is composed with the input image(”ComposeImage”-node) and then it is projectedback to the video space (”Invert”-node) by usingthe inverse of the map that was used for rectifica-tion. Finally, by overlaying (”Overlay”-node) the socomposed image with the foreground informationfrom the matting node we get the final result. Pleasesee the accompanying video for better demonstra-tion of the user interaction with our system.

For non-expert users who do not have the tech-nical knowledge about which transformation to usein order to achieve the desired effect, our systemprovides an application wizard. The user simplychooses the desired effect (cf. the application sce-narios in Section 6), which corresponds to selectingan appropriate graph as the one in Fig. 3. Then thewizard guides the user through all steps, e.g. load-ing image or video sources, specifying the trackingregions, specifying the perspectively correct trans-formations by using a quad metaphor, etc.

The main functional units in our system are:the registration unit, which does the tracking, i.e.the perspective registration, the matting unit, whichseparates foreground and background and the com-positing unit, which actually transfers image infor-mation from one video stream into the other.

The registration unit will be described in de-tail in the next section. It takes the raw video

streams as input and generates a set of 2D ho-mographies which distort each frame in the videosuch that all pixels whose pre-image lies in thereference plane do not move. Hence, within theregistered region, we have trivial pixel correspon-dences throughout the video stream, which makessegmentation as well as compositing very easy. Forthe initial segmentation we are using our interac-tive background technique which is described inSection 5. For creating high-quality mattes weuse state-of-the-art matting techniques [WBC∗05,LSS05]. The compositing is done by just over-painting the corresponding image region, possiblyfollowed by simple luminance correction.

4 RegistrationThe two major objectives of our registration proce-dure are maximum robustness and minimum userinteraction. Hence we are using a combinationof SIFT feature tracking, RANSAC, and homog-raphy matching [BL03]. The input consists of asequence of video frames F0, . . . , Fn. After theregistration we have a sequence of 2D homogra-phies H1, . . . , Hn such that in the distorted imagesHi(Fi) all pixels whose pre-image lies in the ref-erence plane match the corresponding pixel in thereference frame F0. The quality of the registrationprocedure is visualized in Fig. 4.

4.1 Local registrationGiven two frames Fi and Fi+1 we compute theSIFT features [Low04] in both images. Since thesefeatures come with a signature, for every featurepoint pj in Fi we can find the corresponding feature

X-axis (320 pixels)

Y-axis (240 pixels)

Time-axis (385 frames)

(a) (b) (d)

(c)

Figure 4: Rectification. By taking slices through avideo volume we see that the noisy raw video dataas seen in (a) and (c) is well aligned after the regis-tration process (b) and (d).

point qj in Fi+1 which has the most similar signa-ture. The feature pair (pj , qj) is discarded if thereis another SIFT feature nearby, which makes thematch unreliable. For symmetry reasons, we searchfor additional feature pairs by exchanging the rolesof Fi and Fi+1. After this first step, we have a set ofcandidate feature pairs (p0, q0), . . . , (pk, qk) with ktypically being in the range of a few hundreds.

In the second step we apply a RANSAC proce-dure [FB81] in order to find an initial 2D homog-raphy Gi between Fi and Fi+1. For this we pick arandom set of 4 feature pairs to provide the 8 con-straints that we need. If we write Gi in homoge-neous coordinates as a 3 × 3 matrix, we can nor-malize one entry to unity and solve a 8 × 8 systemfor the unknown matrix entries.

We check the matching quality of Gi by com-puting the matching error ej = ‖Gi(pj) − qj‖ forall candidate feature pairs and taking the median ofthese values [Ste99, CSR99]. By this definition weimplicitly assume that at least half of the featuresbelong to the planar region to be tracked. If this as-sumption does not hold, the user can constrain thefeature matching by roughly sketching a region inthe image where the planar region is visible in atleast 50% of the pixels.

After O(k) RANSAC tests, we pick the homog-raphy Gi with the best matching quality, i.e. withthe smallest median matching error. In order tomake the computation more robust, we collect theset of all feature pairs for which the matching errorwith respect to Gi lies below half a pixel. Then wecompute the final homography Gi by least squaresfitting to this over-constrained problem.

The RANSAC procedure has led to very good re-sults in all our experiments. There are two potentialreasons why a tentative homography Gi can have a

bad matching quality: The 4 random feature pairseither do not lie in a plane or they do not lie in thecorrect plane. Consequently, good matching qualityindicates that the 4 random features are lying in theright plane and hence the RANSAC selection crite-rion is correct.

4.2 Global registrationIf we compute only frame to frame homographiesGi then matching errors are accumulating, whichleads to a clearly visible drift. In order to make theregistration procedure even more robust, we haveto directly register each frame Fi to the referenceframe F0.

Matching F0 and F1 is simply done by the lo-cal procedure of the last section. Now assumethat we have already globally registered the framesF1, . . . , Fk−1, i.e. we have homographies Hi map-ping Fi back to F0, i<k. Our goal is to use as muchfeature information as possible to compute Hk.

First, we find feature pairs (pj , qj) between F0

and Fk directly by using the same technique as inthe local registration. For longer video sequencesthese will be only very few if any. For the left-overfeatures qj in Fk we then try to construct new part-ners pj in F0. Since the mapping H1 for F1 is mostlikely to be the most accurate match, we begin bychecking whether we can find feature pairs betweenF1 and Fk. For each such pair (p′j , qj), we add(H1(p

′j), qj) to the list of candidate pairs between

F0 and Fk. We continue this procedure with the re-maining intermediate frames F2, . . . , Fk−1. Even-tually, we have a large set of feature pairs betweenF0 and Fk to which we can apply the local registra-tion procedure.

4.3 Image based homography matchingThe feature-based homography matching fails ifthere are not enough SIFT features in the planarregion to be tracked. In those cases we let theuser sketch the planar region via a simple 2D GUI.The most intuitive metaphors turned out to be: (1)sketching a polygon (e.g. a quad) or (2) painting animage region with a brush tool. In both cases theuser defines a set of pixels Ω that lie in the focusregion. This pixel set can then be tracked by a vari-ant of the standard Lucas-Kanade image matchingalgorithm [BM04], where the non-linear functional

E(Hi) =∑p∈Ω

(F0(p)− Fi(Hi(p)))2

is minimized iteratively. This frame-to-framematching could be also extended and applied in aglobal, multi-frame manner [ZMI99], but the resultswe have achieved with the local method above werestable enough in all our experiments.

5 Interactively Created Background

In the perspectively registered video frames F ′i =

Hi(Fi)) we have established the property that apixel p in frame F ′

i corresponds to the same pixel inany other frame F ′

j if it belongs to the focus regionand if it is not occluded. Under the assumption thatevery point of the focus region is visible in at leastone frame, foreground object removal is straight-forward. All we have to do is to find out whether agiven pixel p in frame F ′

i belongs to the focus re-gion. If it does, its color value can be propagatedalong the time axis in the video volume. By this weremove any occlusion from the focus region.

Instead of implementing a heuristic criterionwhich classifies pixels as foreground or back-ground, we let the user decide through a simple 2Duser interface. The user can browse through thevideo sequence and paint over visible backgroundregions in any frame. Based on the perspective reg-istration, this information can be propagated intoevery other frame. Painting over different parts ofthe background in different frames eventually re-moves any foreground object occluding the focusregion (see the accompanying video and Fig. 5).

By this we generate a background stream for thefocus region which can be used for segmentation bysubtracting the background stream from the originalstream. Notice that the background stream we pro-duce is more or less just a background picture thatdeforms over time based on the homographies Hi.If the background is sufficiently planar this will notgenerate any visible artifacts. In our implementa-tion we use the background subtraction in combi-nation with morphological dilation and erosion tocompute an acceptable forground matte. If higherquality is required, we apply state-of-the-art videocutout techniques [WBC∗05, LSS05].

6 Application Scenarios and Results

In the following we will show how our method canbe used in different application settings.Video stabilization After perspective registration,the video appears very much stabilized (Fig. 4).

(a)

(b) (c)

Figure 5: Interactively created background segmen-tation. In our registered setting the user can createa reference background image (c) by painting visi-ble background information with a brush in severalrectified frames if necessary (a,b).

Even though only the planar focus region that wehave tracked appears completely frozen and the restis distorted according to the deviation from the ref-erence plane, the stabilization effect is surprisinglygood in most experiments. In the accompanyingvideo we show a very shaky example video whereone hardly can identify the seen content in the inputsequence.Object texturing The registration of the videostream allows us to use the rectified planar focusregion, taken from one of the frames, as a can-vas for drawing or pasting textures to the corre-sponding objects (see Fig. 1). The texture canbe propagated to the other frames of the videoby using the 2D homographies. This creates aperspectively correct mapping of the texture tothe moving object. During texture propagationwe apply simple luminance correction based onthe average luminance of the focus region in or-der to create a visually plausible result. In theFig. 9 we show an example where several planarregions are (video-)textured simultaneously.High-Resolution video As a special case of objecttexturing, if we have a low-resolution video as inputthen we can register a high-resolution image to thevideo and copy pixels from the image to the video.This process simulates texture mapping on the ap-proximately planar focus region, however withoutactual 3D reconstruction. The user only has to paintthe region in the low-resolution reference frame ofthe video and then the system exploits the registra-tion information and propagates the information tothe other frames. (e.g. the facade in Fig. 6). Thegeneration of such a mask or matte is described inSection 5.Background replacement In Fig. 7 an example for

Figure 6: High-resolution video. The upper rowshows the input mobile phone low-quality video(320x240). The lower row shows the result video(960x720) of our system after threefold magnifica-tion, pasting the information from a high-resolutionimage (2592x1944). Notice that the statue in theforeground is still low-resolution.

the replacement of the outdoor scenery in the in-put stream is shown. In the first step the given in-put stream is perspectively registered. Rectifyingeach frame, i.e. mapping the windows to a quad,makes it easy to mask out the outdoor scene seenthrough the window. We apply a second registra-tion pass only to the outdoor scenery. To avoidunwanted distortions we restrict the 2D homogra-phies to their translational, rotational, and scalingcomponent. This corresponds to treating the out-side scenery as a plane at infinite distance. The sogenerated mappings between the frames are sim-ply applied to another, stabilized or static outdoorvideo. We replace the outdoor part in the rectifiedspace and map the stream back to the original cam-era space. This results in a very realistic sceneryreplacement.Video Completion Our system reduces the videocompletion problem to image completion. Whenthe video is registered, we can apply an image com-pletion algorithm to the planar part of the scenethat we have used for matching. In order to handleperspective correction the image completion can beperformed in rectified image space [PSK06].Perspectively Correct Video Cut & Paste By us-ing homography matching between planar parts intwo different scenes we are able to transfer moving

Figure 7: Background replacement. Top left: themask used for homography tracking of the indoorpart. Top right: masked background used for simi-larity matching (homography reduced to its transla-tional, rotational and scaling components). Bottomrow shows two different new backgrounds.

characters from one video to another creating plau-sible looking results of perspectively correct move-ments (see Fig. 8 and the accompanying video).After the registration we define a homography be-tween the reference frames of both videos by usinga 2D quad metaphor [PSK06]. Since placing thequads at the correct position in the respective ref-erence frame is difficult, we allow the user to ad-just the mapping by drag-and-drop in the originalvideo. Finally, e.g., we are able to cut out the di-nosaur from the one video and paste it into the otherone creating the illusion of a perspectively correctmovement parallel to the hedge (Fig. 8).

The example shown in Fig. 4 and 5 is a low-quality video stream (320x240) taken with a low-cost digital camera. All other videos shown hereand in the accompanying video are taken with anamateur video camera with a progressive scan at15fps and 720x576. All streams were corrected forlens distortion before using them since homogra-phy tracking is very sensitive to this kind of dis-tortion in general. We have tested our system on anAMD64 3500+ PC with 4GB RAM. The perspec-tive registration procedure never took more than afew seconds per frame. For high-quality mattes ascreated for examples in Figures 1 and 8 we haveused an interface similar to the Video Cutout in-terface of Wang et al. [WBC∗05]. This is a non-trivial task taking only a few seconds up to a few

Figure 8: Perspectively Correct Object Cut&Paste. Using the hedge stream and the dino stream as input wedefine the connection between them interactively by providing two quads as point correspondences to thesystem. After cutting out the dino from the one stream we can paste it perspectively correct into the other,registered one. On the right we see four frames from the final composition.

minutes of interaction time per frame dependingon the matte complexity and the user precision re-quirements. Please take a look at the accompanyingvideo to get a better impression of the video quality.

7 Conclusion

In this paper we have presented a unifying frame-work for pure 2D video processing and editing.The main idea is to take advantage of the perspec-tive registration of the individual frames in the in-put video which stabilizes the video with respect toan approximately planar region in the scene. Forthis we have exploited state-of-the-art tracking tech-niques. By using the rectified reference frame ascanvas a number of useful (pseudo-)3D effects canbe achieved without need for doing any kind of ex-plicit 3D reconstruction. In the registered video vol-ume we are able to interactively extract the back-ground information for matting. Furthermore videocompletion, re-texturing as well as (perspectivelycorrect) video cut & paste can be done based on thisregistered video stream via a simple 2D user inter-face. All the needed operations for the presentedediting applications are simply nodes in a process-ing graph used by our system. Creating of such agraph is as simple as drag-and-drop of the needednodes and connecting them where needed. The us-

ability of our interface as well as the versatility ofpossible editing application was shown.

Our registration approach relies on the existenceof planar geometry in the scene, i.e. the videos weare able to handle must contain approximately pla-nar regions for tracking. Obviously, the perspec-tively corrected video cut & paste produces satisfy-ing results only if the pasted object is sufficientlyplanar or the difference between source and targetperspective is not too large. Still, the variety of theexamples presented here shows that there are quitea few application scenarios for our system.

References[AAC∗06] AGARWALA A., AGRAWALA M., COHEN M., SALESIN

D., SZELISKI R.: Photographing long scenes with multi-viewpoint panoramas. In ACM SIGGRAPH (New York, NY,USA, 2006), ACM Press, pp. 853–861.

[BB95] BEAUCHEMIN S. S., BARRON J. L.: The computation ofoptical flow. In ACM Computing Surveys (1995), vol. 27,pp. 433–467.

[BBS∗08] BOTCHEN R. P., BACHTHALER S., SCHICK F., CHEN M.,MORI G., WEISKOPF D., ERTL T.: Action-based multifieldvideo visualization. IEEE Transactions on Visualization andComputer Graphics 14, 4 (2008), 885–899.

[BBZ96] BASCLE B., BLAKE A., ZISSERMAN A.: Motion deblur-ring and super-resolution from an image sequence. In ECCV(1996), pp. 573–582.

[BL03] BROWN M., LOWE D. G.: Recognising panoramas. InICCV ’03: Proceedings of the Ninth IEEE International Con-ference on Computer Vision (2003), p. 1218.

[BM04] BAKER S., MATTHEWS I.: Lucas-kanade 20 years on: Aunifying framework. IJCV 56, 3 (2004), 221–255.

[BZS∗07] BHAT P., ZITNICK C. L., SNAVELY N., AGARWALA A.,AGRAWALA M., CURLESS B., COHEN M., KANG S. B.:

Figure 9: Multiple planes. In this example we have used the three dominant planes (the left wall, theright wall and the floor) as canvas surfaces. On the left wall we have mapped the facade from anothersource video, on the right wall the ”Grafiti” image is pasted and on the floor another static video stream issynthesized. Notice the moving shadow of the tree. From left to right we see three example frames of theoriginal video and the corresponding rectified versions (top) and the final composition (bottom).

Using photographs to enhance videos of a static scene. InProc. EGSR (2007), pp. 327–338.

[CCSS01] CHUANG Y.-Y., CURLESS B., SALESIN D. H., SZELISKIR.: A bayesian approach to digital matting. In Proceedingsof IEEE CVPR 2001 (December 2001), vol. 2, IEEE Com-puter Society, pp. 264–271.

[CPT03] CRIMINISI A., PEREZ P., TOYAMA K.: Object removal byexemplar-based inpainting. In CVPR (2) (2003), pp. 721–728.

[CSR99] CAN A., STEWART C. V., ROYSAM B.: Robust hierarchi-cal algorithm for constructing a mosaic from images of thecurved human retina. In CVPR (1999), vol. 2, pp. 286–292.

[DCOY03] DRORI I., COHEN-OR D., YESHURUN H.: Fragment-basedimage completion. ACM Transactions on Graphics 22, 3(2003), 303–312.

[FB81] FISCHLER M. A., BOLLES R. C.: Random sample consen-sus: a paradigm for model fitting with applications to imageanalysis and automated cartography. Commun. ACM 24, 6(1981), 381–395.

[HZ03] HARTLEY R., ZISSERMAN A.: Multiple View Geometry inComputer Vision. Cambridge University Press, 2003.

[IA98] IRANI M., ANANDAN P.: Video indexing based on mosaicrepresentation. IEEE Trans. on PAMI 86, 5 (1998), 905–921.

[JHM05] JIA Y.-T., HU S.-M., MARTIN R. R.: Video completionusing tracking and fragment merging. Visual Computer 21,8-10 (2005), 601–610.

[LLH04] LIU Y., LIN W.-C., HAYS J.: Near-regular texture analysisand manipulation. ACM SIGGRAPH 23, 3 (2004), 368–376.

[LLW06] LEVIN A., LISCHINSKI D., WEISS Y.: A closed form solu-tion to natural image matting. CVPR (2006), 61–68.

[Low04] LOWE D. G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2 (2004), 91–110.

[LSS05] LI Y., SUN J., SHUM H.-Y.: Video object cut and paste. InSIGGRAPH ’05: ACM SIGGRAPH 2005 Papers (New York,NY, USA, 2005), ACM Press, pp. 595–600.

[MOTS05] MATSUSHITA Y., OFEK E., TANG X., SHUM H.-Y.: Full-frame video stabilization. In Computer Vision and PatternRecognition, 2005. CVPR 2005. (2005), vol. 1, pp. 50–57.

[MS05] MIKOLAJCZYK K., SCHMID C.: A performance evaluationof local descriptors. IEEE Trans. Pattern Anal. Mach. Intell.27, 10 (2005), 1615–1630.

[PSK06] PAVIC D., SCHOENEFELD V., KOBBELT L.: Interactive im-age completion with perspective correction. Visual Computer22, 9 (2006), 671–681.

[SJTS04] SUN J., JIA J., TANG C.-K., SHUM H.-Y.: Poisson matting.ACM Trans. Graph. 23, 3 (2004), 315–321.

[SSS06] SNAVELY N., SEITZ S. M., SZELISKI R.: Photo tourism:exploring photo collections in 3d. In ACM SIGGRAPH (NewYork, NY, USA, 2006), ACM Press, pp. 835–846.

[SSZ01] SCHARSTEIN D., SZELISKI R., ZABIH R.: A taxonomyand evaluation of dense two-frame stereo correspondence al-gorithms. IEEE Workshop on Stereo and Multi-Baseline Vi-sion, 2001.

[ST04] SAND P., TELLER S.: Video matching. In SIGGRAPH ’04:ACM SIGGRAPH 2004 Papers (New York, NY, USA, 2004),ACM Press, pp. 592–599.

[Ste99] STEWART C. V.: Robust parameter estimation in computervision. SIAM Review 41, 3 (1999), 513–537.

[SYJS05] SUN J., YUAN L., JIA J., SHUM H.-Y.: Image completionwith structure propagation. ACM Trans. Graph. 24, 3 (2005),861–868.

[vdHDT∗07] VAN DEN HENGEL A., DICK A., THORMAHLEN T., WARDB., TORR P. H. S.: Videotrace: rapid interactive scene mod-elling from video. In ACM SIGGRAPH (New York, NY,USA, 2007), ACM Press, p. 86.

[WBC∗05] WANG J., BHAT P., COLBURN R. A., AGRAWALA M., CO-HEN M. F.: Interactive video cutout. In ACM SIGGRAPH(New York, NY, USA, 2005), ACM Press, pp. 585–594.

[WC07] WANG Y., COELHO E. M.: Contextualized videos: Com-bining videos with environment models to support situa-tional understanding. IEEE Transactions on Visualizationand Computer Graphics 13, 6 (2007), 1568–1575.

[WSI04] WEXLER Y., SHECHTMAN E., IRANI M.: Space-time videocompletion. CVPR 01 (2004), 120–127.

[WXSC04] WANG J., XU Y., SHUM H.-Y., COHEN M. F.: Videotooning. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers(New York, NY, USA, 2004), ACM Press, pp. 574–583.

[ZMI99] ZELNIK-MANOR L., IRANI M.: Multi-frame alignment ofplanes. In CVPR (1999), pp. 1151–1156.

Documents

2D Video Editing for 3D Effects - RWTH Aachen University · 2D Video Editing for 3D Effects Darko Pavic, Volker Schoenefeld, Lars Krecklau, Martin Habbecke, Leif Kobbelt´ Computer