1
Human-Assisted Motion Annotation Ce Liu William T. Freeman Edward H. Adelson Massachusetts Institute of Technology Yair Weiss The Hebrew University of Jerusalem Motivations Existing motion databases are either synthetic or limited to indoor, experimental setups [1]. Can we have ground-truth motion for arbitrary, real-world videos? Humans are an expert at segmenting moving objects and perceiving difference between two frames. Can we have a computer vision system to quantify human perception of motion and generate ground-truth for motion analysis? Several issues need to addressed: 1.Is human labeling reliable (compared to the veridical ground-truth) and consistent (across subjects)? 2.How to efficiently label every pixel at every frame for hundreds of real-world videos? Our work We designed a human-in-loop system to annotate motion for real-world videos [2]: Semiautomatic layer segmentation—The user labels contours using polygons, and the system automatically propagates the contours to other frames. The system can also propagate user’s correction across frames. Automatic layer-wise optical flow—The system automatically computes dense optical flow fields for every layer at every frame using user- specified parameters. For each layer, the user picks up the best flow that yields the correct matching and agrees with the smoothness and discontinuities of the image. Semiautomatic motion labeling—When the flow estimation fails, the user can label sparse correspondences between two frames, and the system automatically interpolates it to a dense flow field. Automatic full-frame motion composition. Our methodology is examined by comparing with veridical ground-truth data and user studies. We created a ground-truth motion database consisting of 10 real-world video sequences (still growing). This database can be used for evaluating motion analysis algorithms as well as other vision and graphics applications. (a) A selected frame (b) Layer labeling (c) User-annotated motion (d) Ground-truth from [1] (e) Difference between (c) and (d) Figure 3. For the RubberWhale sequence in [1], we labeled 20 layers in (b) and obtained the annotated motion in (c). The “ground-truth” motion from [1] is shown in (d). The error between (c) and (d) is 3.21º in average angular error (AAE) and 0.104 in average endpoint error (AEP), excluding the outliers (black dots) in (d). (a) (b) (c) (e) (d) Figure 1. The graphical user interface (GUI) of our system: (a) main window for labeling contours and feature points; (b) depth controller to change depth value; (c) magnifier; (d) optical flow viewer; (e) control panel. Figure 5. Some frames of the ground-truth motion database we created. We obtained ground-truth flow fields that are consistent with object boundaries, as shown in column (3) and (4). In comparison, the output of an optical flow algorithm [3] is shown in column (5). From Table 1, the performance of this algorithm on our database is worse than the performance on the Yosemite sequence (1.723° AAE, 0.071 AEP). References [1 ] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodo-logy for optical flow. In Proc. ICCV, 2007. [2 ] C Liu, W. T. Freeman, E. H. Adelson, Y. Weiss. Human-Assisted Motion Annotation. Submitted to CVPR’08. [3 ] A. Bruhn, J.Weickert, , and C. Schnörr. Lucas/Kanade meets Horn/Schunk: combining local and global optical flow methods. IJCV, 61(3):211–231, 2005. (a) (b) (c) (d) (e) (f) (g) (h) AAE 8.996 º 58.905 º 2.573 º 5.313 º 1.924 º 5.689 º 5.243 º 13.306 º AEP 0.976 4.181 0.456 0.346 0.085 0.196 0.385 1.567 Figure 4. The marginal ((a)~(h)) and joint ((i)~(n)) statistics of the ground- truth motion from the database we created (log histogram). Symbol u and v denotes horizontal and vertical motion, respectively. From these statistics it is evident that horizontal motion dominates vertical; vertical motion is sparser than horizontal; flow fields are sparser than natural images; spatial derivatives are sparser than temporal derivatives. Table 1. The performance of an optical flow algorithm [3] on our database Figure 2. The consistency of nine subjects’ annotation. Clockwise from top left: the image frame, mean labeled motion, mean absolute error (red: higher error, white: lower error), and error histogram. Experiment We applied our system to annotating a veridical example from [1] (Figure 3). Our annotation is very close to theirs: 3.21° AAE, 0.104 AEP. The main difference is on the occluding boundary. We tested the consistency of human annotation (Figure 3). The mean error is 0.989° AAE, 0.112 AEP. The error magnitude correlates with the blurriness of the image. We created a ground-truth motion database containing 10 real-world videos with 341 frames (Figure 5, Table 1) for both indoor and outdoor scenes. The statistics of the ground- truth motion are plotted in Figure 4. QuickTime™ and a decompressor are needed to see thi Color map for flow visualizatio n System Features We used the-state-of-the art computer vision algorithms to design our system. Many of the objective functions in contour tracking, flow estimation and flow interpolation have L1 norms for robustness. Techniques such as iterative reweighted least square (IRLS), pyramid-based coarse-to-fine search and occlusion/outlier detection were intensively used for optimizing these nonlinear objective functions. The system was written in C++, and Qt TM 4.3 was used for GUI design (Figure 1). Our system has all the

Human-Assisted Motion Annotation Ce Liu William T. Freeman Edward H. Adelson Massachusetts Institute of Technology Yair Weiss The Hebrew University of

Embed Size (px)

Citation preview

Page 1: Human-Assisted Motion Annotation Ce Liu William T. Freeman Edward H. Adelson Massachusetts Institute of Technology Yair Weiss The Hebrew University of

Human-Assisted Motion AnnotationCe Liu William T. Freeman Edward H. Adelson

Massachusetts Institute of TechnologyYair Weiss

The Hebrew University of Jerusalem

Motivations• Existing motion databases are either synthetic or limited to indoor, experimental

setups [1]. Can we have ground-truth motion for arbitrary, real-world videos?

• Humans are an expert at segmenting moving objects and perceiving difference between two frames. Can we have a computer vision system to quantify human perception of motion and generate ground-truth for motion analysis?

• Several issues need to addressed:

1. Is human labeling reliable (compared to the veridical ground-truth) and consistent (across subjects)?

2. How to efficiently label every pixel at every frame for hundreds of real-world videos?

Our work• We designed a human-in-loop system to annotate motion for real-world

videos [2]:

Semiautomatic layer segmentation—The user labels contours using polygons, and the system automatically propagates the contours to other frames. The system can also propagate user’s correction across frames.

Automatic layer-wise optical flow—The system automatically computes dense optical flow fields for every layer at every frame using user-specified parameters. For each layer, the user picks up the best flow that yields the correct matching and agrees with the smoothness and discontinuities of the image.

Semiautomatic motion labeling—When the flow estimation fails, the user can label sparse correspondences between two frames, and the system automatically interpolates it to a dense flow field.

Automatic full-frame motion composition.

• Our methodology is examined by comparing with veridical ground-truth data and user studies.

• We created a ground-truth motion database consisting of 10 real-world video sequences (still growing). This database can be used for evaluating motion analysis algorithms as well as other vision and graphics applications.

(a) A selected frame (b) Layer labeling (c) User-annotated motion (d) Ground-truth from [1] (e) Difference between (c) and (d)

Figure 3. For the RubberWhale sequence in [1], we labeled 20 layers in (b) and obtained the annotated motion in (c). The “ground-truth” motion from [1] is shown in (d). The error between (c) and (d) is 3.21º in average angular error (AAE) and 0.104 in average endpoint error (AEP), excluding the outliers (black dots) in (d).

(a)(b) (c)

(e) (d)

Figure 1. The graphical user interface (GUI) of our system: (a) main window for labeling contours and feature points; (b) depth controller to change depth value; (c) magnifier; (d) optical flow viewer; (e) control panel.

Figure 5. Some frames of the ground-truth motion database we created. We obtained ground-truth flow fields that are consistent with object boundaries, as shown in column (3) and (4). In comparison, the output of an optical flow algorithm [3] is shown in column (5). From Table 1, the performance of this algorithm on our database is worse than the performance on the Yosemite sequence (1.723° AAE, 0.071 AEP).

References[1] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodo-

logy for optical flow. In Proc. ICCV, 2007.

[2] C Liu, W. T. Freeman, E. H. Adelson, Y. Weiss. Human-Assisted Motion Annotation. Submitted to CVPR’08.

[3] A. Bruhn, J.Weickert, , and C. Schnörr. Lucas/Kanade meets Horn/Schunk: combining local and global optical flow methods. IJCV, 61(3):211–231, 2005.

(a) (b) (c) (d) (e) (f) (g) (h)AAE 8.996º 58.905º 2.573º 5.313º 1.924º 5.689º 5.243º 13.306º

AEP 0.976 4.181 0.456 0.346 0.085 0.196 0.385 1.567

Figure 4. The marginal ((a)~(h)) and joint ((i)~(n)) statistics of the ground-truth motion from the database we created (log histogram). Symbol u and v denotes horizontal and vertical motion, respectively. From these statistics it is evident that horizontal motion dominates vertical; vertical motion is sparser than horizontal; flow fields are sparser than natural images; spatial derivatives are sparser than temporal derivatives.

Table 1. The performance of an optical flow algorithm [3] on our database

Figure 2. The consistency of nine subjects’ annotation. Clockwise from top left: the image frame, mean labeled motion, mean absolute error (red: higher error, white: lower error), and error histogram.

Experiment• We applied our system to annotating a veridical example

from [1] (Figure 3). Our annotation is very close to theirs: 3.21° AAE, 0.104 AEP. The main difference is on the occluding boundary.

• We tested the consistency of human annotation (Figure 3). The mean error is 0.989° AAE, 0.112 AEP. The error magnitude correlates with the blurriness of the image.

• We created a ground-truth motion database containing 10 real-world videos with 341 frames (Figure 5, Table 1) for both indoor and outdoor scenes. The statistics of the ground-truth motion are plotted in Figure 4.

QuickTime™ and a decompressor

are needed to see this picture.

Color map for flow visualization

System Features• We used the-state-of-the art computer vision algorithms to design our system.

Many of the objective functions in contour tracking, flow estimation and flow interpolation have L1 norms for robustness. Techniques such as iterative reweighted least square (IRLS), pyramid-based coarse-to-fine search and occlusion/outlier detection were intensively used for optimizing these nonlinear objective functions.

• The system was written in C++, and QtTM 4.3 was used for GUI design (Figure 1). Our system has all the components to make annotation simple and easy, and also gives the user full freedom to label motion manually.