Augmenting Aerial Earth Maps with Dynamic Informationirfan/p/2009-Kim-AAEMWDI.pdf · ∗e-mail: {kihwan23,sangmin,glaze,irfan}@cc.gatech.edu †Videos and other details available

Augmenting Aerial Earth Maps with Dynamic InformationKihwan Kim, Sangmin Oh, Jeonggyu Lee and Irfan Essa !

School of Interactive Computing, GVU Center, Georgia Institute of Technologyhttp://www.cc.gatech.edu/cpl/projects/augearth/ †

ABSTRACT

We introduce methods for augmenting aerial visualizations of Earth(from services like Google Earth or Microsoft Virtual Earth) withdynamic information obtained from videos. Our goal is to makeAugmented Aerial Earth Maps that visualize an alive and dynamicscene within a city. We propose different approaches for analyzingvideos of cities with pedestrians and cars, under differing condi-tions and then created augmented Aerial Earth Maps (AEMs) withlive and dynamic information. We further extend our visualizationsto include analysis of natural phenomenon (specifically clouds) andadd this information to the AEMs adding to the visual reality.

1 INTRODUCTION

Earth can be visualized on the Internet with the growth of onlineAerial Earth Maps(AEMs) services (e.g., Google Earth, MicrosoftVirtual Earth, etc.). We can visually browse through cities acrossthe globe from our desktops or mobile devices and see 3D models ofbuildings, street views and topologies. Information such as traffic,restaurant locations, tourist sites, and other services is provided,within a geo-spatial database. However, such visualizations, whilerich in information, are static and do not showcase the dynamism ofwhat is happenning in real world. We are motivated to add dynamicinformation to such online visualizations of the globe.

In this paper, we introduce an approach to generate ALIVEcities. So that one can browse and see a city with dynamic andalive Aerial Earth Maps. Our approach relies on analysis of videosfrom different sources around the city. Fig. 1 shows a static still ofsuch an augmented visualization driven by analysis and subsequentregistration of 36 video sources.

To achieve this goal, we have to address several technical chal-lenges. First, we develop a framework to extract information aboutthe geometry of the scene, the status of the scene and also the move-ments in the environment from video. Second, we register the viewfrom the given video to a view in the AEMs. In cases where wehave multiple instances of views, but still not full coverage, weneed to infer what is happening in-between the views, in a domain-specific manner. This requires designing models of dynamics fromobserved data. Third, we generate visualizations from the observeddata onto the AEMs. This includes synthesizing behaviors basedon videos, procedural information captured from them and updat-ing views as they are manipulated in the AEMs.

2 SCENARIOS UNDER STUDY FOR AEMS

Extracting dynamic information from video feeds for augmentingAEMs is a challenging problem, primarily due to wide variety ofconditions/configurations that we will have to deal with. For ex-ample, we want to see people moving in different situations, andalso show traffic motions. We also want to show moving clouds. In

!e-mail: {kihwan23,sangmin,glaze,irfan}@cc.gatech.edu†Videos and other details available from our website.

Figure 1: An overview of the Augmented Earth Map generated by oursystem. We make use of 36 videos to add dynamic information to cityvisualization. Information from each input video of traffic, people, andclouds is extracted, then it is mapped onto the Earth map in real-time.See video on our website.

each of the above instances, we have to deal with different view-points of videos and in many cases with incomplete information.We also have to track moving objects and determine coherence be-tween different viewpoints. To address these issues of variationacross different domains of interest to us, we consider four scenar-ios that address the distribution of cameras and motion of objects.We describe briefly these cases here. The results showcase the vari-ations in configurations in how we generate a variety of augmentedvisualizations of live cities.#1. Direct Mapping: Video is analyzed directly and tracked. Datais projected onto the limited regions covered by camera’s field ofview. We showcase several examples of people walking around.#2. Overlapping Cameras with Complex Motion: Several cam-eras with overlapping views observe a relatively small region con-currently. The motion within the area is complex. We demonstratethis case with people playing sports.#3. Sparse Cameras with Simple Motions: Sparsely distributedcameras cover a wide area but dynamic information is simple. Forexample, traffic cameras are separately observing a highway and themotion of vehicles is relatively simple and is predictable betweennearby regions.#4. Sparse Cameras and Complex Motion: Cameras are sparselydistributed and each of them observes a different part in a largerarea and the motion of the target scene is complex. This is the casewhere we observe and model natural phenomena such as clouds inthe sky.

3 RELATED AND MOTIVATING PAST WORK

Our work builds on existing efforts in computer vision on trackingobjects [16, 13] and multi-view registration [5]. We also rely onbehavioral animation approaches [8, 9] from computer graphics.

We leverage on the work of Seitz and Dyer [12], which uses mor-phing with camera models to generate in-between views (view syn-thesis). They reconstruct intermediate views using precise stereopairs and local correspondences. In our approach, where a largeamount of videos are available and need to be registered in real-time, we use global blending approaches to register a number ofviews and visualize them immediately in the AEMs.

Harris [4] introduces a method for rendering realistic cloudsusing imposters and sprites generated procedurally for real-timegames. While extremely realistic, these do not suit our purposesas we are interested in driving clouds from video data. Turk and

(a) (b)Figure 2: Example of DM: (a)Single video observing pedestrian andcar (b)Screen space position mapped onto virtual plane space

O’Brien [14] uses RBF interpolation to find an implicit surfacesfrom scattered data using constraint points. We also rely on Per-lin noise [7] for generating clouds volume or sprites for generatingclouds maps.

A work most closest to our work is the Video Flashlight system[10]. This is a surveillance application, which tracks people in afixed region using multiple cameras and maps onto the observedregion. Several other similarly motivated surveillance applicationshave also been introduced [11].

4 VIDEO TO AUGMENTED AERIAL MAPS

Now we provide technical details of our approaches and alsopresent how the configurations of cameras and dynamic informa-tion described in section 2 are analyzed and then visualized. Wefirst start with the simplest scenario, then we introduce more com-plex situations with different approaches. More details about eachapproach are available from our project website.

4.1 Direct mapping from Single Video: PedestrainsIn our first scenario, we have video from a single viewpoint and weare interested in projecting the video and the related motions ontoan aerial view from an AEM. This scenario requires direct mappingof tracked objects in a scene frame onto the virtual plane. This isthe simplest of our scenarios and in essence a basic building blockfor all of the other scenario cases described in Section 2.

We rely on direct mapping from video to geometry to visualizepedestrians in videos. As shown in Fig. 2, we first track [1, 13]the pedestrian and extract screen-space coordinates and velocities.These measurements are directly registered onto plane space of vir-tual environment. If the homogeneous coordinates in video frameare px,y = [x,y,1]T, the new 2D location at planar space on virtualenvironment p̂ is simply calculated by p̂ = Hpx,y, where H is aplanar homography. Subsequently, if the objects (pedestrians) aremoving in the video with the velocity v, we can also project thevelocity onto the earth map plane by v̂ = Hvx,y. This velocity isused to match simple human motion capture data gathered off-line(from [15]) to visualize moving pedestrians. We first sample thetrajectories of objects, then insert exact one cycle of walking dataonto them and interpolate the positions.

In our current implementation, we do not classify objects or rec-ognize their states in the scene. We assume all moving objects ona sidewalk are pedestrians walking or doing some simple linear ac-tions. On the road, we assume the object of interest is a car moving.

Direct mapping, as described here is used when (1) a region ofinterest is covered by a single view point, and (2) the motions ofobjects simple. In Sections 4.2-4.4, we introduce methods to handlemore complex scenarios.

4.2 Overlapping Cameras, Complex Motions: SportsWe now move to the domain where we have videos with overlap-ping views and motions that have some structure, with several mo-tions occurring at the same time. While we have employed this casefor a variety of scenarios, we demonstrate it here in the domain ofsports.

Sports videos usually have multiple views and in most instanceswe can rely on the field markings to help with registration. The

overlapping views in this domain also require additional types ofmodeling and synthesis beyond the direct mapping from singleview (Section 4.1).

(a) (b)Figure 3: The range of approximately invariant to distortion:(a) and(b) both are back-projected scenes from same video.

We start by obtaining field of views (FOVs) fi and camera homo-graphies Hi (from each view to a corresponding patch in the AEM)from the videos as described earlier. Then, the videos are rectifiedto top-views based on the extracted homographies, and registeredonto the corresponding regions of the AEM.

For each video, the rectified top view is used as a texture on theAEM plane. Then, this textured video is re-projected to a virtualview based on the model view matrix in the AEM environment. Werefer to this view as Back-projected view and the angle between theoriginal view and its the back-projected view as ! .

Once multiple views covering the same region are registeredonto the AEM plane, their rectified views are also overlapped. Ourgoal is to generate virtual views based on the two consecutive viewsthat exhibit the most similar viewing angle ! .

First, we search for the pair of the closest two views exhibit-ing small ! ’s. Let these consecutive views and the correspondingangles be denoted by fi, fi+1 and !i, !i+1 respectively and denoterectified planar texture as f̂i and f̂i+1.

Then, we compute the two weights for both views based on theangle differences where the smaller angle leads to a larger weight: "i = !i+1

!i+!i+1and "i+1 = !i

!i+!i+1. If the "i is close to one, which

indicates that the angle between the virtual view and the given viewfi are similar, then it is safe for a fragment shader to render the back-projected view based on f̂i only. This is the case where the virtualview is similar to f̂i, so that the virtually back-projected scene seemsapproximately invariant to distortions (Fig. 3(a)(b)).

In general, we blend not only a pair of rectified scenes (f̂i,f̂i+1)but also the subtracted background scenes generated from a pair ofview videos. Now, suppose that f (t) and g(t) is a bicubic functionfor both views. Then, we blend each pixel in the virtual view as :pv = f ("i)pi + g("i)pi+1 + "bkgpbkg, where "bkg = 1" ( f ("i)+g("i)), pbkg = "ipbkg

i +"i+1pbkgi+1, and pbkg

i and pbkgi+1 are the pixels

of the subtracted background computed sperately from f̂i and f̂i+1respectively.

Now, the virtual view is almost identical to the background if thetarget view is out-of range from the both views. If the target viewapproaches a view to some extent, the synthesized view smoothlytransitions to the back-projected view of the closest viewpoint. InSection 4.3 we introduce an approach to visualize moving objectswhen multiple videos are not overlapped and each camera is dis-tributed sparsely.

4.3 Sparse Cameras with Simple Motion: TrafficWe demonstrate this scenario as the case of analyzing videos oftraffic and synthesizing traffic movements dynamically on AEMs.The biggest technical challenge with this scenario is that as we havesparsely distributed, non-overlapping cameras, we do not have theadvantage of knowing how the geometry or the movement fromeach camera is related to the other. To deal with this problem, we

(a) (b) (c)Figure 4: Red nodes indicate observable region M(X) and Greennodes are unobserved regions M̃(X). (a) The middle chain corre-sponds to the traffic conditions on the graph which represents thetraffic system. (b) split: outgoing regions O(Xi) from node Xi aremarked. (c) merging: incoming regions I(Xi) to node Xi are marked.

need to model the movements in each view and connect the ob-servations between cameras, i.e., to model flows from one view toanother. In our framework, we add two functionalities. (1) Model-ing the flow using a graph-based representation to infer a plausibletraffic-flow across unobservable regions. (2) Develop a synthesisframework that can be driven from such data.

Fig. 4(a) shows the topology of traffic nodes. Each node is apatch of the road and has a traffic state Xi and a synthesized state Zi.A measurement Yi is defined only in observable nodes monitored bycameras. From visual tracking [1, 13] , we get estimates of positionand velocities of objects from video. The measurement Yi consistsof the positions and the velocities of the low-level features fi andthe detected cars ri.

The state Xi is designed to capture the essential information nec-essary to visualize traffic flow: (1) the average traffic flow ni, and(2) the average velocity vi of cars, i.e., Xi = (ni,vi). By averageflow, we mean the average number of cars passing through a regionin unit time, whereas vi denotes the average velocity of the cars.

Then, the entire traffic system is defined as X = {Xi|1 # i #kX} where kX is the number of nodes. Additionally, the physi-cal length of every i-th node is obtained from the available geo-spatial database and is denoted by di. Once the traffic system isdecomposed into a graph manually, a set of observable regionsM(X) $ X with available video measurements are identified. Onthe other hand, the unobservable regions are denoted by M̃(X)where X = M(X)%M̃(X).

The obtained measurement information Yi is used to estimate anobservable state Xi & M(X), after the projection onto the virtualmap plane using the available homography Hi for that region. First,the average speed v̂i of the cars is estimated as an average of pro-jected speeds of fi w.r.t. the homography Hi. Secondly, the flow ofthe cars passing through the ith region, n̂i, can be computed usingthe fact that the number of cars in the region Nri is the product of theflow multiplied by the average time di/vi for cars to pass a region,i.e., Nri = n̂i · (di/v̂i).

Once the set of states M(X̂) are estimated for the observable re-gions, they are used to estimate the unobserved states M̃(X). Weadopt the Bayesian networks [6] formalism to exploit the fact thatthe unknown traffic conditions M̃(X) can be estimated by propa-gating the observed information from the spatial correlation mod-els. The whole traffic graph is a directed graph where an edge froma region Xj to another region Xi exists whenever traffic can movefrom Xj to Xi. For every node Xi, a local spatial model P(Xi|I(Xi))between the node Xi and the incoming nodes I(Xi) is specified (SeeFig. 4). Once all the local spatial models are defined, the poste-rior traffic conditions P(Xi|M(X̂i)),'Xi & M̃(X) at the unobservednodes are inferred using belief propagation [6, 3].

We make an assumption that the average flow Xi|n of the cars ina region Xi matches the sum of the average flow of the cars fromthe incoming regions I(Xi) with a slight variation wi which followswhite Gaussian noise : Xi|n = !I(Xi) Xj|n +wi. For velocity, we as-

sume that the average speed in a region matches the average speedof the cars with a slight variation qi which is again a white Gaussiannoise : Xi|v = (!I(Xi) Xj|v)/NI(Xj) + qi. The variance of the Gaus-sian noises, both wi and qi, are set to be proportional to the lengthdi of the target region i.

Finally, to visualize traffic flow based on the estimated trafficstates X̂, we developed a parameterized version of Reynolds’ be-havior simulation approach [8]. By parameterized behavior sim-ulation, we mean that the cars are controlled by the associatedbehavior-based controller, but the behaviors are parameterized bythe current traffic conditions. The controller of a car in the i-th re-gion is parameterized by Xi. Hence, the behavior of a car Zi variesif the estimated traffic condition within a region changes or if thecar moves onto other adjacent traffic regions based on given Xi.

4.4 Sparse Cameras with Complex Motion: CloudsOur final scenario aims to use videos of natural phenomenon, pri-marily clouds and adding them to AEMs for an additional senseof reality. Cameras are spatially distributed and only a small sub-set of the targeted sky area that is to be synthesized is visible by theFOVs of the cameras. For measuring dynamic movement of clouds,we also extract velocities from videos. We assume that the videosalways have to look at 90 degree of elevation, and zero degree az-imuth. We refer to this video as an anchor video.

We use a Radial Basis Function(RBF)([2]) to globally interpo-late density of clouds in unobserved sky region based on mul-tiple videos. The main concept of our interpolation follows amethod described in [14] where they interpolate implicit surfacefrom given set of scattered data points. We use this methodto interpolate density of unobservable region in the sky. In ourwork, constraint points are the location of feature points(xi,yi)extracted from each input video, and basis vector is defined asG = [G1(x1,y1), . . . ,Gn(xn,yn)]( encoded by strong gradient andvelocity vectors representing density of clouds. Now a basis func-tion di j between any constraints points is chosen as ||(xi" x j)2 +(yi " y j)2||. Using these measurements we can globally interpo-late the cloud density of any points in unobserved sky region byweighted sum of basis function.

Now, the generated density map is used as a density function forprocedural modeling of cloud textures([7]). Figure 5(a)(b) showsan example of density map generated from four videos. However,if the generated density map is not appropriate due to mis-detectionof feature vectors, the system provides an user interface to edit thedensity map by adding additional basis vectors.

Once cloud textures are generated, the generated sky textures aremapped onto the sky domes [17]. To visualize sky, representing adynamic of current sky, the mapped sky textures moves based onthe velocity captured from anchor video.

5 DISCUSSION OF EXPERIMENTS AND RESULTS

To validate our approach, we undertook a series of experiments fordifferent scenarios on a variety of challenging domains, under vary-ing conditions. Fig. 6 shows results of the static stills of each sce-narios. Please see our website for videos and other results.

The prototype system is developed using C++/OpenGL on acomputer with Quad-core 2.5GHz, 2GB RAM, and NVidia Quadro

(a) (b) (c)

Figure 5: Generating clouds layers procedurally using videos: (a)videos (b) Interpolated map from RBF (c) resulting cloud layer.

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 6: Results from our prototype system using 36 videos : (1) View Blending : (a) 5 Cameras for soccer game (b) Two broadcasting footagesof NCAA Football game (c) Three surveillance camera. (2) Traffic : (d) Merging Lanes (e) Rendered traffic Scene and corresponding simulatedscene (f) 8 cameras for larger scale traffic simulation including merge and split (3) Pedestrians : (g) Direct mapping of pedestrian having simplemotion (4) Clouds : (h)Four videos for clouds and sky generation.

FX770M graphics card. The resulting visualizations are rendered inreal-time at approximately 20 frames per second where 500 targetscan be tracked at maximum for the traffic flow scenario.

For the scenario that require view blending, the resultingview transitions are smooth and provide dynamic visualizations(Fig. 6(a,b,c,)). In the traffic scenario, the visualized traffic closelymatches the average speeds of the real traffic system (Fig. 6(d,e)).However, it was noted that the quality of the estimated traffic flowdeteriorates in proportion to the distance between the cameras. Thecameras used in our results are placed no more than 0.5 miles away,and provide qualitatively plausible visualizations.

In the challenging large-scale experiments shown in Fig. 6(f),8 cameras are installed at different locations where the visualizedroad system consists of 13 observed and 26 unobserved regions.Some cameras observe one-way road while others observe two-wayand the traffic topology include merge, exits and bridge crossings.

Various styles of sky and clouds are generated as shown inFig. 6(h). Although the visualization results do not capture the exactshape of the individual clouds or the exact sky atmosphere, move-ment and density of the distributed clouds reflect the characteristicsof the input videos plausibly.

6 SUMMARY, LIMITATIONS, AND FUTURE WORK

In this paper, we introduced methods for Augmented Aerial EarthMaps (AAEMs) with diverse types of real-time live information,namely pedestrians, sports scenes, traffic flows and sky. The pro-posed set of solutions are targeted to address different types ofscenes in terms of camera network configuration/density and thedynamism presented by the scenes. The prototype system whichintegrates all the components run in real-time and demonstrates thatour work provides a novel, vivid, and more engaging virtual envi-ronment through which the users would browse the cities of now.

It is also important to note a few of our limitations. First, wecannot directly apply our traffic flow approaches to the scenes withhigher-level controls and behaviors, e.g., intersections with trafficlights, which is an interesting avenue for future research. Secondly,our solutions do not support the automatic retrieval of high-levelsemantic information, e.g., car accidents or street burglaries.

In our future work, we aim to overcome the above limitations,and incorporate even more types of additional dynamic informationsuch as flowing rivers, trees with wind, sun, weather patterns, envi-ronmental condition and even aerial objects like birds and airplanes.

Acknowledgements: We would like to thank the Georgia TechAthletic Association for providing us with the videos of the footballgame. Thanks also to the reviewers for their insightful comments.

REFERENCES

[1] J.-Y. Bouguet. Pyramidal implementation of the lucas kanade featuretracker. In Intel Corporation, 2003.

[2] M. D. Buhmann and M. J. Ablowitz. Radial Basis Functions : Theoryand Implementations. Cambridge University., 2003.

[3] B. Frey and D. MacKay. A revolution: Belief propagation in graphswith cycles. In NIPS’98, 1998.

[4] M. J. Harris. Real-time cloud simulation and rendering. In SIG-GRAPH ’05 Courses, page 222. ACM, 2005.

[5] R. Hartley and A. Zisserman. Multiple View Geometry in ComputerVision. Cambridge Univ. Press, 2000.

[6] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks ofPlausible Inference. Morgan Kaufmann, 1988.

[7] K. Perlin. An image synthesizer. SIGGRAPH Comput. Graph.,19(3):287–296, 1985.

[8] C. W. Reynolds. Flocks, herds and schools: A distributed behavioralmodel. In SIGGRAPH’87, pages 25–34. ACM Press, 1987.

[9] C. W. Reynolds. Steering behaviors for autonomous characters. InGDC ’99, pages 768–782, 1999.

[10] H. S. Sawhney, A. Arpa, R. Kumar, S. Samarasekera, M. Aggarwal,S. Hsu, D. Nister, and K. Hanna. Video flashlights. In 13th Euro-graphics workshop on Rendering, pages 157–168, 2002.

[11] I. O. Sebe, J. Hu, S. You, and U. Neumann. 3d video surveillance withaugmented virtual environments. In IWVS ’03, pages 107–112, 2003.

[12] S. M. Seitz and C. R. Dyer. View morphing. In SIGGRAPH ’96, pages21–30, 1996.

[13] J. Shi and C. Tomasi. Good features to track. In Proceedings of IEEECVPR, pages 593–600. IEEE Computer Society, 1994.

[14] G. Turk and J. F. O’Brien. Shape transformation using variationalimplicit functions. In SIGGRAPH ’99, pages 335–342, 1999.

[15] WWW. ”cmu motion capture database http://mocap.cs.cmu.edu/”.[16] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. ACM

Comput. Surv., 38(4):13, 2006.[17] G. Zotti and M. E. Groller. A sky dome visualisation for identification

of astronomical orientations. In INFOVIS ’05, pages 2–10, 2005.

Documents

Augmenting Aerial Earth Maps with Dynamic Informationirfan/p/2009-Kim-AAEMWDI.pdf · ∗e-mail: {kihwan23,sangmin,glaze,irfan}@cc.gatech.edu †Videos and other details available