Fast Exact Hyper-graph Matching with Dynamic Programming for Spatio-temporal Data

Embed Size (px)

Text of Fast Exact Hyper-graph Matching with Dynamic Programming for Spatio-temporal Data

  • J Math Imaging VisDOI 10.1007/s10851-014-0503-6

    Fast Exact Hyper-graph Matching with Dynamic Programmingfor Spatio-temporal Data

    Oya eliktutan Christian Wolf Blent Sankur Eric Lombardi

    Received: 13 March 2013 / Accepted: 13 February 2014 Springer Science+Business Media New York 2014

    Abstract Graphs and hyper-graphs are frequently used torecognize complex and often non-rigid patterns in computervision, either through graph matching or point-set matchingwith graphs. Most formulations resort to the minimization ofa difficult energy function containing geometric or structuralterms, frequently coupled with data attached terms involvingappearance information. Traditional methods solve the min-imization problem approximately, for instance resorting tospectral techniques. In this paper, we deal with the spatio-temporal data, for a concrete example, human actions invideo sequences. In this context, we first make three realisticassumptions: (i) causality of human movements; (ii) sequen-tial nature of human movements; and (iii) one-to-one map-ping of time instants.We show that, under these assumptions,the correspondence problem can be decomposed into a set ofsubproblems such that each subproblem can be solved recur-sively in terms of the others, and hence an efficient exactmin-imization algorithm can be derived using dynamic program-ming approach. Secondly, we propose a special graphicalstructure which is elongated in time. We argue that, insteadof approximately solving the original problem, a solution canbe obtained by exactly solving an approximated problem. Anexact minimization algorithm is derived for this structure and

    O. eliktutan (B) B. SankurDepartment of Electrical-Electronics Engineering, Bogazii Uni-versity, Istanbul, Turkeye-mail: oya.celiktutan@boun.edu.tr

    B. Sankure-mail: bulent.sankur@boun.edu.tr

    C. Wolf E. LombardiUMR CNRS 5205, INSA-Lyon, LIRIS, Universit de Lyon, CNRS,69621 Lyon, Francee-mail: christian.wolf@liris.cnrs.fr

    E. Lombardie-mail: eric.lombardi@liris.cnrs.fr

    successfully applied to action recognition in two settings:video data and Kinect coordinate data.

    Keywords Hyper-graph matching Dynamic program-ming Action recognition

    1 Introduction

    Many computer vision problems can be formulated as graphsand their associated algorithms, since graphs provide a struc-tured, flexible and powerful representation for visual data.Graph representation has been successfully used in prob-lems such as tracking [51,60], object recognition [1,4,52],object categorization [15], and shape matching [30,46]. Inthis paper, we propose a novel method for point-set match-ing via hyper-graphs embedded in spacetime coordinates,where the point sets correspond to interest points of themodeland scene videos, respectively. We focus on applications ofobject or event recognition [29]; in particular, in localizingand recognizing human actions in video sequences.

    The action recognition problem is solved via graphmatch-ing, which is conceived as a search problem for the bestcorrespondence between two point-sets, i.e., the one belong-ing to the model structured into a grapha model graphand the other one belonging to the scenea scene point set.In this context, an action is represented by a model graphconsisting of a set of nodes with associated geometric andappearance features, and of a set of edges that represent rela-tionships between nodes. Action recognition itself becomesthen a graph matching problem, which can be formalizedas a generic energy-minimization problem [52]. The graphmatching energy is a scalar quantity which is lower for likelyassignments respecting certain invariances and geometric

    123

  • J Math Imaging Vis

    transformations from model to scene, whereas it results inhigher values for unrealistic and unlikely assignments.

    The energy function, more explicitly, is composed oftwo terms, i.e., an appearance term and a geometric term.The appearance term takes into consideration the differencebetween intrinsic properties of points between the modeland the scene. Popular examples of appearance informationare SIFT [35], histograms of gradient and optical flow [25],shape contexts [3] etc. The geometric term, on the other hand,takes into consideration the structural deformation betweenmodel and scene point assignments. The geometric term isconceived to provide certain invariance properties, such asthe rate of action realization. In this work, we utilize hyper-graphs, which are a generalization of graphs and allow edges(hyper-edges, strictly speaking) to connect any number ofvertices, typically more than two. Similar to other formula-tions in object recognition, our hyper-edges connect tripletsof nodes, which allow verifying geometric invariances thatare invariant to scale-changes, as opposed to pairwise termsthat are restricted to distances [14,48,59]. Unlike [14,48,59],our geometric terms are split into spatial terms and tem-poral terms, which is better suited to the spatio-temporaldata.

    Our graphs are built on spacetime interest points, as in[48], similar to spatial formulations for object recognition asin [11,14,59] and others. This allows us stay independent ofpreceding segmentation steps, as in [10], or on parts models,where an object or an action is decomposed into a set of parts[5,17]. Parts based formulations like [5] are able to achievegood results with graphs which are smaller than our graphs,but they depend on very strong unary terms constructed frommachine learning.

    Inference is the most challenging task in graph matching,as useful formulations are known to be NP-hard problems[52]. While the graph isomorphism problem is conjecturedto be solvable in polynomial time, it is known that subgraphisomorphismexact subgraph matchingis NP-complete[19]. Practical solutions for graph matching therefore mustrely on approximations or heuristics.

    We propose an efficient hyper-graph matching method forspatio-temporal data based on three realistic assumptions:(i) causality of human movements; (ii) sequential nature ofhuman movements and limited warping; and (iii) one-to-onemapping of model-to-scene time instants. We present a theo-retical result stating that, under these assumptions, the corre-spondence problem can be decomposed into a set of subprob-lems such that each subproblem can be solved recursively interms of the others, and hence derive an exact minimizationalgorithm. A constructive proof shows that the minimizationproblem can be solved efficiently using dynamic program-ming.

    The approximated model enables us to achieve perfor-mance faster than real-time. This work thus falls into the

    category of methods calculating the exact solution for anapproximated model, unlike methods calculating an approx-imate solution, for instance [5,14,30,48,52,58,59] or meth-ods which perform exact structure preserving matching like[29]. In this sense, it can be compared to [11], where theoriginal graph of a 2D object is replaced by a k-tree allowingexact minimization by the junction tree algorithm. Our solu-tion is different in that the graphical structure is not createdrandomly but is derived from the temporal dimensions ofvideo data. We propose three different approximated graphi-cal structures, all characterized by a reduction in the numberof interest points, which allows faster than real-time perfor-mance on mid-level GPUs.

    1.1 Related Work: Graph Matching

    There is a plethora of literature on optimization and practi-cal solutions of graph matching algorithms. This extensiveresearch on practical solutions to graphmatching can be ana-lyzed under different perspectives. One common classifica-tion is:

    1. Exact matching: A strictly structure-preserving corre-spondence between the two graphs (e.g., graph isomor-phism) or at least between parts (e.g., subgraph isomor-phism) is searched.

    2. Inexact matching: Compromises in the correspondenceare allowed in principle by admitting structural deforma-tions up to some extent. Matching proceeds by minimiz-ing an objective (energy) function.

    In the computer vision context, most recent papers ongraph matching are based on inexact matching of valuedgraphs, i.e., graphs with additional geometric and/or appear-ance information associated with nodes and/or edges. Practi-cal formulations of this problem are known to be NP-hard[52], which requires approximations of the model or thematching procedure. Two different strategies are frequentlyused:

    1. Calculation of an approximate solution of the minimiza-tion problem;

    2. Calculation of the exact solution, most frequently of anapproximated model.

    1.2 Approximate Solution

    A well known family of methods solves a continuousrelaxation of the original combinatorial problem. Zass andShashua [59] presented a soft hyper-graph matching methodbetween sets of features that proceeds through an iterativesuccessive projection algorithm in a probabilistic setting.

    123

  • J Math Imaging Vis

    They extended the Sinkhorn algorithm [23], which is usedfor soft assignment in combinatorial problems, to obtain aglobal optimum in the special case when the two graphshave the same number of vertices and an exact matching isdesired. They also presented a sampling scheme to handle thecombinatorial explosion due to the degree of hyper-graphs.Zaslavskiy et al. [58] employed a convexconcave program-ming approach to solve the least-squares problem over thepermutation matrices. More explicitly, they proposed tworelaxations to the quadratic assignment problem over the setof permutation matrices which results in one quadratic con-vex and one quadratic concave optimization problem. Theyobtained an approximate solution of the matching problemthrough a path following algorithm that tracks a path of localminimum by linearly interpolating convex and concave for-mulations.

    A