J Math Imaging VisDOI 10.1007/s10851-014-0503-6
Fast Exact Hyper-graph Matching with Dynamic Programmingfor Spatio-temporal Data
Oya eliktutan Christian Wolf Blent Sankur Eric Lombardi
Received: 13 March 2013 / Accepted: 13 February 2014 Springer Science+Business Media New York 2014
Abstract Graphs and hyper-graphs are frequently used torecognize complex and often non-rigid patterns in computervision, either through graph matching or point-set matchingwith graphs. Most formulations resort to the minimization ofa difficult energy function containing geometric or structuralterms, frequently coupled with data attached terms involvingappearance information. Traditional methods solve the min-imization problem approximately, for instance resorting tospectral techniques. In this paper, we deal with the spatio-temporal data, for a concrete example, human actions invideo sequences. In this context, we first make three realisticassumptions: (i) causality of human movements; (ii) sequen-tial nature of human movements; and (iii) one-to-one map-ping of time instants.We show that, under these assumptions,the correspondence problem can be decomposed into a set ofsubproblems such that each subproblem can be solved recur-sively in terms of the others, and hence an efficient exactmin-imization algorithm can be derived using dynamic program-ming approach. Secondly, we propose a special graphicalstructure which is elongated in time. We argue that, insteadof approximately solving the original problem, a solution canbe obtained by exactly solving an approximated problem. Anexact minimization algorithm is derived for this structure and
O. eliktutan (B) B. SankurDepartment of Electrical-Electronics Engineering, Bogazii Uni-versity, Istanbul, Turkeye-mail: email@example.com
B. Sankure-mail: firstname.lastname@example.org
C. Wolf E. LombardiUMR CNRS 5205, INSA-Lyon, LIRIS, Universit de Lyon, CNRS,69621 Lyon, Francee-mail: email@example.com
E. Lombardie-mail: firstname.lastname@example.org
successfully applied to action recognition in two settings:video data and Kinect coordinate data.
Keywords Hyper-graph matching Dynamic program-ming Action recognition
Many computer vision problems can be formulated as graphsand their associated algorithms, since graphs provide a struc-tured, flexible and powerful representation for visual data.Graph representation has been successfully used in prob-lems such as tracking [51,60], object recognition [1,4,52],object categorization , and shape matching [30,46]. Inthis paper, we propose a novel method for point-set match-ing via hyper-graphs embedded in spacetime coordinates,where the point sets correspond to interest points of themodeland scene videos, respectively. We focus on applications ofobject or event recognition ; in particular, in localizingand recognizing human actions in video sequences.
The action recognition problem is solved via graphmatch-ing, which is conceived as a search problem for the bestcorrespondence between two point-sets, i.e., the one belong-ing to the model structured into a grapha model graphand the other one belonging to the scenea scene point set.In this context, an action is represented by a model graphconsisting of a set of nodes with associated geometric andappearance features, and of a set of edges that represent rela-tionships between nodes. Action recognition itself becomesthen a graph matching problem, which can be formalizedas a generic energy-minimization problem . The graphmatching energy is a scalar quantity which is lower for likelyassignments respecting certain invariances and geometric
J Math Imaging Vis
transformations from model to scene, whereas it results inhigher values for unrealistic and unlikely assignments.
The energy function, more explicitly, is composed oftwo terms, i.e., an appearance term and a geometric term.The appearance term takes into consideration the differencebetween intrinsic properties of points between the modeland the scene. Popular examples of appearance informationare SIFT , histograms of gradient and optical flow ,shape contexts  etc. The geometric term, on the other hand,takes into consideration the structural deformation betweenmodel and scene point assignments. The geometric term isconceived to provide certain invariance properties, such asthe rate of action realization. In this work, we utilize hyper-graphs, which are a generalization of graphs and allow edges(hyper-edges, strictly speaking) to connect any number ofvertices, typically more than two. Similar to other formula-tions in object recognition, our hyper-edges connect tripletsof nodes, which allow verifying geometric invariances thatare invariant to scale-changes, as opposed to pairwise termsthat are restricted to distances [14,48,59]. Unlike [14,48,59],our geometric terms are split into spatial terms and tem-poral terms, which is better suited to the spatio-temporaldata.
Our graphs are built on spacetime interest points, as in, similar to spatial formulations for object recognition asin [11,14,59] and others. This allows us stay independent ofpreceding segmentation steps, as in , or on parts models,where an object or an action is decomposed into a set of parts[5,17]. Parts based formulations like  are able to achievegood results with graphs which are smaller than our graphs,but they depend on very strong unary terms constructed frommachine learning.
Inference is the most challenging task in graph matching,as useful formulations are known to be NP-hard problems. While the graph isomorphism problem is conjecturedto be solvable in polynomial time, it is known that subgraphisomorphismexact subgraph matchingis NP-complete. Practical solutions for graph matching therefore mustrely on approximations or heuristics.
We propose an efficient hyper-graph matching method forspatio-temporal data based on three realistic assumptions:(i) causality of human movements; (ii) sequential nature ofhuman movements and limited warping; and (iii) one-to-onemapping of model-to-scene time instants. We present a theo-retical result stating that, under these assumptions, the corre-spondence problem can be decomposed into a set of subprob-lems such that each subproblem can be solved recursively interms of the others, and hence derive an exact minimizationalgorithm. A constructive proof shows that the minimizationproblem can be solved efficiently using dynamic program-ming.
The approximated model enables us to achieve perfor-mance faster than real-time. This work thus falls into the
category of methods calculating the exact solution for anapproximated model, unlike methods calculating an approx-imate solution, for instance [5,14,30,48,52,58,59] or meth-ods which perform exact structure preserving matching like. In this sense, it can be compared to , where theoriginal graph of a 2D object is replaced by a k-tree allowingexact minimization by the junction tree algorithm. Our solu-tion is different in that the graphical structure is not createdrandomly but is derived from the temporal dimensions ofvideo data. We propose three different approximated graphi-cal structures, all characterized by a reduction in the numberof interest points, which allows faster than real-time perfor-mance on mid-level GPUs.
1.1 Related Work: Graph Matching
There is a plethora of literature on optimization and practi-cal solutions of graph matching algorithms. This extensiveresearch on practical solutions to graphmatching can be ana-lyzed under different perspectives. One common classifica-tion is:
1. Exact matching: A strictly structure-preserving corre-spondence between the two graphs (e.g., graph isomor-phism) or at least between parts (e.g., subgraph isomor-phism) is searched.
2. Inexact matching: Compromises in the correspondenceare allowed in principle by admitting structural deforma-tions up to some extent. Matching proceeds by minimiz-ing an objective (energy) function.
In the computer vision context, most recent papers ongraph matching are based on inexact matching of valuedgraphs, i.e., graphs with additional geometric and/or appear-ance information associated with nodes and/or edges. Practi-cal formulations of this problem are known to be NP-hard, which requires approximations of the model or thematching procedure. Two different strategies are frequentlyused:
1. Calculation of an approximate solution of the minimiza-tion problem;
2. Calculation of the exact solution, most frequently of anapproximated model.
1.2 Approximate Solution
A well known family of methods solves a continuousrelaxation of the original combinatorial problem. Zass andShashua  presented a soft hyper-graph matching methodbetween sets of features that proceeds through an iterativesuccessive projection algorithm in a probabilistic setting.
J Math Imaging Vis
They extended the Sinkhorn algorithm , which is usedfor soft assignment in combinatorial problems, to obtain aglobal optimum in the special case when the two graphshave the same number of vertices and an exact matching isdesired. They also presented a sampling scheme to handle thecombinatorial explosion due to the degree of hyper-graphs.Zaslavskiy et al.  employed a convexconcave program-ming approach to solve the least-squares problem over thepermutation matrices. More explicitly, they proposed tworelaxations to the quadratic assignment problem over the setof permutation matrices which results in one quadratic con-vex and one quadratic concave optimization problem. Theyobtained an approximate solution of the matching problemthrough a path following algorithm that tracks a path of localminimum by linearly interpolating convex and concave for-mulations.
A specific form of relaxation is done by spectral meth-ods,which study the similarities between the eigen-structuresof the adjacency or Laplacian matrices of the graphs or ofthe assignment matrices corresponding to the minimizationproblem formulated in matrix form. In particular, Duchenneet al.  generalized the spectral matching method fromthe pairwise graphs presented in  to hyper-graphs byusing a tensor-based algorithm to represent affinity betweenfeature tuples, which is then solved as an eigen-problemon the assignment matrix. More explicitly, they solved therelaxed problem by using a multi-dimensional power iter-ation method, and obtained a sparse output by taking intoaccount l1-norm constraints instead of the classical l2-norm.Leordeanu et al. made an improvement on the solution tothe integer quadratic programming problem in  by intro-ducing a semi-supervised learning approach. Lee et al. solved the relaxed problem by a random walk algorithm onthe association graph of the matching problem, where eachnode corresponds to a possible matching between the twohyper graphs. The algorithm jumps from one node (match-ing) to another one with given probabilities.
Another approach is to decompose the original discretematching problem into subproblems, which are then solvedwith different optimization tools. A case in point, Torresaniet al.  solved the subproblems through graph-cuts, Hun-garian algorithm and local search. Lin et al.  first deter-mined a number of subproblems where each one is char-acterized by local assignment candidates, i.e., by plausiblematches betweenmodel and scene local structures. For exam-ple, in action recognition domain, these local structures cancorrespond to human body parts. Then, they built a candi-dacy graph representation by taking into account these candi-dates on a layered (hierarchical) structure and formulated thematching problem as a multiple coloring problem. Finally,Duchenne et al.  extended one dimensional multi-labelgraph cuts minimization algorithm to images for optimizingthe Markov Random Fields (MRFs).
1.3 Approximate Graphical Structure
Many optimization problems in vision are NP-hard. Thecomputation of exact solutions is often unaffordable, e.g.,in real-time video processing. A common approach is toapproximate the data model, for instance the graphical struc-ture, as opposed to applying an approximate matching algo-rithm to the complete data model. One way is to simplify thegraph by filtering out the unfruitful portion of the data beforematching. For example, a method for object recognition hasbeen proposed byCaetano et al. ,which approximates themodel graph by building a k-tree randomly from the spatialinterest points of the object. Then, matching was calculatedusing the classical junction tree algorithm  known forsolving the inference problem in Bayesian Networks.
A special case is the work by Bergthold et al. , whoperform object recognition using fully connected graphs ofsmall size (between 5 and 15 nodes). The graphs can be smallbecause the nodes correspond to semantically meaningfulparts in an object, for instance landmarks in a face or bodyparts in human detection. A spanning tree is calculated onthe graph, and from this tree a graph is constructed describ-ing the complete state space. The A algorithm then searchesthe shortest path in this graph using a search heuristic. Themethod is approximative in principle, as hypotheses are dis-carded due to memory requirements. However, for some ofthe smaller graphs used in certain applications, the exact solu-tion can be calculated. Matching is typically done between1 and 5 s per graph, but the worst case is reported to takeseveral minutes.
1.4 Related Work: Action Recognition
While the methods described in Sect. 1.1 focus on objectrecognition in still images or in 3D as practical applications,we here present a brief review on the most relevant methodsthat apply graph matching to human action recognition prob-lem in video sequences. Graph-based methods frequentlyexploit sparse local features extracted from the neighbor-hood of spatio-temporal interest points (STIPs). A spatio-temporal interest point can be defined as a point exhibit-ing saliency in the space and time domains, i.e., high gradi-ent or local maxima of the spatio-temporal filter responses.The most frequently used detectors are periodic (1D Gaborfilters) , 3D Harris corner detector , extension ofScale-invariant Feature Transform (SIFT) to time domain etc. It is typical to describe the local neighborhood ofan interest point by concatenation of gradient values ,of optical flow values , or of spatial-temporal jets ,histogram of gradient and optical flow values (HoG andHoF)  or 3D Scale-invariant Feature Transform (SIFT).
J Math Imaging Vis
Ta et al.  have built hyper-graphs fromproximity infor-mation by thresholding distances between spatio-temporalinterest points . Given amodel and a scene graph, match-ing is conducted by the algorithm of Duchenne et al. in .Similarly, Gaur et al.  modeled the relationship of STIPs in a local neighborhood, i.e., they have built local fea-ture graphs from small temporal...