7
Robust Vehicle Tracking for Urban Traffic Videos at Intersections Li C. 12 , Chiang A. 12 , Dobler G. 13 , Wang Y. 2 , Xie K. 14 , Ozbay K. 14 , Ghandehari M. 14 , Zhou J. 1 , Wang D. 1 1 Center for Urban Science + Progress, New York University, NY, USA 2 Department of Electrical and Computer Engineering, New York University, NY, USA 3 Department of Physics, New York University, NY, USA 4 Department of Civil and Urban Engineering, New York University, NY, USA chenge.li, atc327, greg.dobler, yw523, kun.xie, kaan.ozbay, masoud, jz2308, dw1609 @nyu.edu Abstract We develop a robust, unsupervised vehicle tracking sys- tem for videos of very congested road intersections in urban environments. Raw tracklets from the standard Kanade- Lucas-Tomasi tracking algorithm are treated as sample points and grouped to form different vehicle candidates. Each tracklet is described by multiple features including po- sition, velocity, and a foreground score derived from robust PCA background subtraction. By considering each track- let as a node in a graph, we build the adjacency matrix for the graph based on the feature similarity between the track- lets and group these tracklets using spectral embedding and Dirichelet Process Gaussian Mixture Models. The proposed system yields excellent performance for traffic videos cap- tured in urban environments and highways. 1. Introduction and Motivation Traffic safety is one of the most pressing modern, ur- ban problems. In 2014 alone, there were 6,064,000 traffic accidents in the United States resulting in 29,989 civilian deaths [1]. Typically traffic conflicts are characterized and cataloged by subjective observations in the field or by re- viewing field videos manually, resulting in sparse classifi- cation of potential hazards[2]. Furthermore, the identifi- cation of high-risk locations and the assessment of safety improvement solutions are done through the use of histor- ical crash data, implying that traffic safety engineers must wait to observe actual accidents. Given rapid advances in object tracking methodology in recent years, computer vi- sion (CV) techniques for urban traffic surveillance offer a promising alternative to develop non-crash, surrogate safety measures [3]. A number of recent studies have used CV- based methods for traffic conflict analysis [4, 5, 6, 7, 8, 9]. In this paper, we develop a robust system to automat- ically extract vehicle trajectory data from video data ob- rPCA Foreground Blob Detection KLT tracking subSampling & Filtering Smoothing Adjacency matrix construction DPGMM Unify Label across truncations vehicle trajectories video input Figure 1. A representation of the steps in our vehicle tracking system. Raw video is first sent through a KLT tracking algo- rithm generating positions and velocities of feature point track- lets. Each tracklet is tagged with a foreground “blob” separated from the background via Robust PCA. Stationary tracklets are fil- tered out to avoid tracking non-moving objects and tracklets are then smoothed for increased accuracy. An adjacency is defined for each tracklet pair based on the position, velocity, and blob label and this matrix is used to group tracklets into coherent trajectories via spectral clustering. tained by existing traffic cameras from the New York City Department of Transportation (NYCDOT). These trajectory information, which will be used in our subsequent study, can be used to develop realistic surrogate safety measures to both identify high risk locations and assess implemented safety improvements on time scales of days to weeks rather than years. 1.1. Background and Related Works Although there has been a substantial research effort addressing vehicle tracking, designing accurate methods in the real world (especially in busy urban intersections) remains a significant challenge. Automated traffic anal- ysis systems typically fall into two categories. The first 978-1-5090-3811-4/16/$31.00 c 2016 IEEE IEEE AVSS 2016, August 2016, Colorado Springs, CO, USA

Robust Vehicle Tracking for Urban Traffic Videos at ...vision.poly.edu/index.html/uploads/robust_tracking_final.pdfRobust Vehicle Tracking for Urban Traffic Videos at Intersections

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Robust Vehicle Tracking for Urban Traffic Videos at ...vision.poly.edu/index.html/uploads/robust_tracking_final.pdfRobust Vehicle Tracking for Urban Traffic Videos at Intersections

Robust Vehicle Tracking for Urban Traffic Videos at Intersections

Li C.1 2, Chiang A.1 2, Dobler G.1 3, Wang Y.2, Xie K.1 4, Ozbay K.1 4, Ghandehari M.1 4, Zhou J.1, Wang D.11Center for Urban Science + Progress, New York University, NY, USA

2Department of Electrical and Computer Engineering, New York University, NY, USA3Department of Physics, New York University, NY, USA

4Department of Civil and Urban Engineering, New York University, NY, USAchenge.li, atc327, greg.dobler, yw523, kun.xie, kaan.ozbay, masoud, jz2308, dw1609 @nyu.edu

Abstract

We develop a robust, unsupervised vehicle tracking sys-tem for videos of very congested road intersections in urbanenvironments. Raw tracklets from the standard Kanade-Lucas-Tomasi tracking algorithm are treated as samplepoints and grouped to form different vehicle candidates.Each tracklet is described by multiple features including po-sition, velocity, and a foreground score derived from robustPCA background subtraction. By considering each track-let as a node in a graph, we build the adjacency matrix forthe graph based on the feature similarity between the track-lets and group these tracklets using spectral embedding andDirichelet Process Gaussian Mixture Models. The proposedsystem yields excellent performance for traffic videos cap-tured in urban environments and highways.

1. Introduction and Motivation

Traffic safety is one of the most pressing modern, ur-ban problems. In 2014 alone, there were 6,064,000 trafficaccidents in the United States resulting in 29,989 civiliandeaths [1]. Typically traffic conflicts are characterized andcataloged by subjective observations in the field or by re-viewing field videos manually, resulting in sparse classifi-cation of potential hazards[2]. Furthermore, the identifi-cation of high-risk locations and the assessment of safetyimprovement solutions are done through the use of histor-ical crash data, implying that traffic safety engineers mustwait to observe actual accidents. Given rapid advances inobject tracking methodology in recent years, computer vi-sion (CV) techniques for urban traffic surveillance offer apromising alternative to develop non-crash, surrogate safetymeasures [3]. A number of recent studies have used CV-based methods for traffic conflict analysis [4, 5, 6, 7, 8, 9].

In this paper, we develop a robust system to automat-ically extract vehicle trajectory data from video data ob-

rPCA Foreground Blob Detection

KLT tracking

subSampling & Filtering

Smoothing

Adjacency matrix construction

DPGMM

Unify Label across

truncations

vehicle trajectories

video input

Figure 1. A representation of the steps in our vehicle trackingsystem. Raw video is first sent through a KLT tracking algo-rithm generating positions and velocities of feature point track-lets. Each tracklet is tagged with a foreground “blob” separatedfrom the background via Robust PCA. Stationary tracklets are fil-tered out to avoid tracking non-moving objects and tracklets arethen smoothed for increased accuracy. An adjacency is defined foreach tracklet pair based on the position, velocity, and blob labeland this matrix is used to group tracklets into coherent trajectoriesvia spectral clustering.

tained by existing traffic cameras from the New York CityDepartment of Transportation (NYCDOT). These trajectoryinformation, which will be used in our subsequent study,can be used to develop realistic surrogate safety measuresto both identify high risk locations and assess implementedsafety improvements on time scales of days to weeks ratherthan years.

1.1. Background and Related Works

Although there has been a substantial research effortaddressing vehicle tracking, designing accurate methodsin the real world (especially in busy urban intersections)remains a significant challenge. Automated traffic anal-ysis systems typically fall into two categories. The first

978-1-5090-3811-4/16/$31.00 c©2016 IEEE IEEE AVSS 2016, August 2016, Colorado Springs, CO, USA

Page 2: Robust Vehicle Tracking for Urban Traffic Videos at ...vision.poly.edu/index.html/uploads/robust_tracking_final.pdfRobust Vehicle Tracking for Urban Traffic Videos at Intersections

Figure 2. A typical scene from an NYCDOT camera. Thestreet-level vantage point leads to numerous occlusions, lightingchanges, and perspective effects. In addition, urban video is oftencongested with stop-and-go conditions.

involves tracking of whole objects identified by fore-ground/background separation [4] and vehicle detection[5, 6]. This method is less robust to heavy congestionand occlusion effects. The second tracks local featureson objects (vehicles) which are subsequently grouped to-gether to form coherent trajectories of individual objects(e.g, [7, 8]). This results in additional computational com-plexity and sensitivity to noise, but the feature points basedmethods handle occlusions more effectively and are morerobust to varying lighting conditions and diverse vehiclemotion patterns. [7] implement feature based tracking bygrouping features according to a common motion constraint(i.e., constant feature point separation over time as theymove with little relative motion). N. Saunier et al in [8] ex-tend this by connecting (or disconnecting) Kanade-Lucas-Tomasi (KLT) feature point tracklets based on grouping andsegmentation thresholds in space. These approaches, whichattempt to capture common motion by simply thresholdingthe tracklet pairs’ spatial and temporal relationships, oftenfail in dense urban traffic videos. Proper thresholds need tobe adaptive for different videos or even vehicles of differentsizes in the same video. Failure to sufficently adapt theseparameters to changing conditions strongly affects the ul-timate performance of the tracking. [9] recently combinedboth background subtraction and low-level feature identi-fication by linking the foreground “blobs” to form track-lets by matching low-level FREAK (Fast retina keypoint)[10] feature points within each blob. A Finite State Ma-chine (FSM) was used to model different situations suchas fragmentation, splitting, and merging of the blob track-lets to form the final tracks. The diversity of traffic videosresults in complex FSM transistions and the performancein different scenarios depends strongly on factors includingthe duration of “missing” an object and its fragmentationfactor, both of which are difficult to optimize.

Our system is based on the feature point tracking ap-proach, but with a unique methodology designed for usein street-level traffic camera video. As we describe below,

Figure 3. The KLT tracking result shows the effects of perspectiveas well as the stationary points which are to be filtered out as non-vehicle tracklets in subsequent steps. Feature points are shown inred and tracklets in green.

we first use a KLT tracker to track feature points and gen-erate tracklets across video snippets. Each vehicle can havemany feature points and thus many tracklets. We overcomecomplexities due to the presence of false detections (e.g.,pedestrians, shadows, parked cars) by automatically clus-tering the tracklets into individual vehicles based on theirspatio-temporal features as well as higher-level features de-rived from foreground/background separation. Instead ofgrouping by thresholding using multiple criteria, we inte-grate both low-level and high-level information into an ad-jacency matrix and used spectral clustering and DirichletProcess Gaussian Mixture Models (DPGMM) to group andsegment all tracklets automatically.

1.2. Challenges in Urban Traffic Video From Street-Level Cameras

Successful systems for the detection and classificationof vehicles on highways – which are typically characterizedby homogeneous and constant traffic flows with significantvehicle separations have been developed. In [6], Kim etal used the 3D-model-based vehicle detection algorithm onthe NGSIM dataset [22] to achieve 85% accuracy. How-ever, these algorithms require accurate models for many dif-ferent types of vehicles, and are difficult to apply to urbanenvironments where vehicles have deformable appearancesdue to occlusions or viewing angles. Increased congestion,shadowing, occlusion, and pedestrians present significantchallenges in urban traffic video. Furthermore, in the caseof our NYCDOT cameras, the videos themselves have lowresolution (480×640 pixels), frequent illumination changesamong frames due to the camera’s auto-iris feature, and vi-brate due to wind.

As shown in Figure 2, the cameras are mounted atheights of several meters (as opposed to the high bird’seye view that is typical in highway datasets) leading to ahigh degree of partial occlusions in dense traffic. Thesedense traffic conditions also lead to stop-and-go conditions,complicating classical learned background techniques (e.g.,

Page 3: Robust Vehicle Tracking for Urban Traffic Videos at ...vision.poly.edu/index.html/uploads/robust_tracking_final.pdfRobust Vehicle Tracking for Urban Traffic Videos at Intersections

Figure 4. Tracklets are shown before (left) and after (right) filter-ing and smoothing. Note that the filtering process removes themajority of stationary points on building corners and street paint.

Mixture of Gaussians, non-parametric models, etc.) as thestopped vehicles are hard to separate from the background.Shadowing by the vehicles themselves also presents com-plications as they generate feature points that are difficult todistinguish from those on actual vehicles and these shadowstend to connect adjacent vehicles.

2. Proposed System

Figure 1 illustrates the components of our methodology.The complete system integrates object-level and feature-level information into coherent vehicle trajectories that canthen be used to develop surrogate safety measures for agiven intersection.

2.1. KLT Feature Point Tracker

We first implement a standard Kanade-Lucas-Tomasi(KLT) tracker [11] to generate short tracklets that typicallytrack corners on vehicles. We used KLT tracker because itis fast, flexible with many tunable parameters. We are usingthe off-the-shelf implementation from openCV library. Aswe show in Figure 3, a given vehicle can produce more thanone tracklet and, given that KLT points come and go dueto noise, low spatial resolution, poor lighting conditions, orocclusions, these tracklets need not overlap in time. That is,the KLT tracker may lose track of a feature point at someframe, and then detect and track that point again in a futureframe, and so those tracklets corresponding to a given vehi-cle typically live in different time intervals and have shortlengths. Thus, each tracklet i consists of positions ~xi(tj)and velocities ~vi(tj) for frames tj in which the KLT pointis detected.

Figure 5. The results of the rPCA foreground/background separa-tion are shown for both street-level NYCDOT (top) and NGSIM(bottom) video. While the results effectively separate the movingvehicles from the constant background, In both cases over- andunder-segmentation of vehicles is apparent.

2.2. Tracklet Filtering and Smoothing

The KLT tracker described above will also produce “spu-rious” tracklets for stationary points (e.g., corners on build-ings or crosswalk stripes) or very short tracklets due to noisefluctuations. If we denote the maximum velocity of a track-let by vmax and the total number of frames by Nfr, we filterout these spurious detections by imposing minimum vmax

and Nfr thresholds. By applying the threshold to vmax (asopposed to, for example, the average velocity magnitude〈|~v|〉) we ensure that stopped cars which subsequently moveare not removed from our tracklets sample.

To reduce noise in the remaining tracklets due to camerashaking, lighting changes, and pixelization, we smooth thetrajectories both spatially and temporally. The KLT featurepoint locations have sub-pixel precision, and these locationsalso varies with the camera shakeness. Spatial smoothing isdone to get rid of these flickering small numbers by fittingthe Y (X) curve with piece-wise linear first order splines inthe (x, y) plane. Temporal smoothing is performed on ~x(t)across frames with the LOWESS filter [12]. The smoothedand filtered results are shown in Figure 4.

2.3. Robust PCA Background Subtraction

Once tracklets are derived for a video snippet, we wishto group them according to the individual vehicles to whichthey belong. Simply thresholding the trajectory pairs’ spa-tial and temporal relationships often fails in dense urbantraffic video. For example, a tracklet representing the rearbumper of a leading vehicle may be quite close to one repre-senting the front bumper of a trailing vehicle in (~x,~v) space.To overcome this, we augment a tracklet i at frame tj witha label bi(tj) indicating the overlapping moving foreground“blobs” at frame tj , derived from foreground/background

Page 4: Robust Vehicle Tracking for Urban Traffic Videos at ...vision.poly.edu/index.html/uploads/robust_tracking_final.pdfRobust Vehicle Tracking for Urban Traffic Videos at Intersections

Figure 6. Four corresponding point pairs are shown for the street-level (from the NYCDOT camera; top) and overhead (from GoogleEarth; bottom) perspectives of the same scene. Using this infor-mation we are able to construct the homography matrix and mapstreet-level trajectories into parallel tracklets (see Figure 7).

separation of the video. Tracklets belonging to the same ve-hicle will ideally fall into the same foreground blob. Thosewhich do not fall in any blob are assigned to the nearest blob(defined by distance to a blob edge). This blob label repre-sents a high-level feature as opposed to a local, low-levelfeature and will be used to compare tracklets when cluster-ing tracklets (see §2.5)).

To generate these foreground blobs we use Robust PCA(rPCA) [13], which decomposes the video data into a lowrank background and a sparse foreground and outperformssimple foreground/background separation methods espe-cially in the presence of frequent illumination changes. TherPCA is implemented via incremental Principle ComponentPursuit [14, 15], which is computationally more efficientthan the original rPCA method. Despite its efficacy, asshown in Figure 5, for congested urban scenes rPCA oftenforms either incomplete, unconnected blobs (multiple blobsfor a single vehicle, which we mitigate via binary closing)or overconnected blobs (a single blob for multiple vehicles).Furthermore, as with other foreground/background separa-tion techniques, stationary vehicles are grouped into the lowrank background. This over- and under-segmenting of blobscomplicates blob label comparisons, so in the clusteringstep below we use the blob index to assign each trackleti a blob “position” defined as its center of mass~bcen,i.

Figure 7. Unwarped (left) and warped (using the information fromFigure 6; right) trajectories from the street-level NYCDOT cam-eras.

2.4. Perspective transformation

The NYCDOT cameras are typically only several metersabove ground level. The result is that vehicles have signifi-cantly different pixels sizes based on their location within aframe, and tracklets of different points on the same vehicleare not parallel on the image plane, as shown in Figure 3).Thus, tracklet clustering algorithms that depend on ~x and ~vin pixel units will perform poorly.

The alternative is to either use scene information to con-vert trajectories to the world space or to perform perspectivewarping to parallelize the trajectories. Since we are not con-cerned with absolute velocities and positions, and since wedo not know the intrinsic camera parameters of the NYC-DOT cameras, we implement the latter approach by identi-fying four corresponding point pairs in a NYCDOT frameand a Google Earth map for the same intersection to calcu-late the homography matrix (see Figure 6). An example ofthe rectified trajectories is shown in Figure 7.

2.5. Grouping Tracklets Using Spectral Clustering

Spectral clustering has previously been used to analyzetraffic at intersections by clustering vehicle trajectories invideo. For example, [16, 17, 18] identify global patterns ofan intersection by identifying clusters which represent par-ticular characteristic trajectories; i.e., a cluster might repre-sent a left or right turn, on-coming traffic, etc. Since theirprimary objective was to assess the global pattern (as op-posed to vehicle-vehicle interactions as is our goal), theassumption is that the individual trajectory information isavailable after manual rectification or that per-vehicle tra-jectories were not needed.

We group the Ntr tracklets into coherent vehicle trajec-tories via spectral clustering [19] as follows. For each pairof trajectories i and j that overlap temporally, we define anadjacency based on four features: the maximum positionalseparation along the tracklet xij,max ≡ maxk |~xi(tk) −~xj(tk)|, the maximum velocity separation in two dimen-sions along the tracklet vx,ij,max ≡ maxk |vx,i(tk) −

Page 5: Robust Vehicle Tracking for Urban Traffic Videos at ...vision.poly.edu/index.html/uploads/robust_tracking_final.pdfRobust Vehicle Tracking for Urban Traffic Videos at Intersections

vx,j(tk)| and vy,ij,max ≡ maxk |vy,i(tk) − vy,j(tk)|, andthe maximum separation of the center of mass of the bloblabels bcen,ij,max ≡ maxk |~bcen,i(tk)−~bcen,j(tk)|. The ad-jacency between two tracklets Aij is defined as,

lnAij = −~w • ~fij , (1)

where ~w is a weight vector for each feature in ~fij ≡(xij,max, vx,ij,max, vy,ij,max, bcen,ij,max).

The exponential kernel resulted in more accurate tracksin our case than the more commonly used Gaussian kernel,and so the final result is based on the exponential kernelconstruction. The weight vector represents hyperparame-ters which must be tuned to give optimal results for a givenvideo. Furthermore, we note that we use the maximum val-ues for position and velocity separation instead of mean ormedian values since, for “same-vehicle tracklet pairs” thatcorrespond to the same rigid vehicle, the mean and max-imum separation remain similar throughout the vehicle’strajectory, whereas for “different-vehicle tracklet pairs” themaximum can be very different, making it a stronger dis-criminant (for example two vehicles traveling side-by-sidefor the majority of the video and then diverging towards theend).

We note that, if i and j fall in the same blob for all k,bcen,ij,max = 0. This representation using center of mass ismore robust than just the blob index representation becauseeven if the vehicle is over-segmented into multiple blobs,the center locations are still relatively near each other. How-ever, in the case of a large blob that connects multiple vehi-cles, the associated tracklets must rely on other features tofurther distinguish them. Lastly, for tracklets which do notoverlap at any frame tk, we set Aij to zero.

We use a modified version of the implementation ofspectral clustering in the SciKit-Learn (SKL) Python pack-age [20] to individually cluster subcomponents of the fullNtr × Ntr adjacency matrix. The subcomponents are de-fined by thresholding Aij according to xij,max ≤ d to formconnected components which are never separated by morethan a distance d. This drastically reduces the computationtime since (while not truly sparse) Aij is dominated by verysmall adjacency values. For each connected component, theSKL spectral clustering implementation performs the spec-tral embedding and dimensionality reduction according tothe K lowest eigenvalues of the Laplacian matrix. In ourmodified implementation, rather than using the SKL de-fault clustering methods (K-Means or discretization, bothof which require a known number of clusters which is notour case here) on this lower dimensional manifold, we usea Dirichlet Processs Gaussian Mixture Model (DPGMM)[21] to identify the number of clusters and label each track-let.

Lastly, it is clear that for long videos Ntr can be verylarge making the spectral clustering of Aij computation-

Figure 8. The initial groups from thresholding adjacency matrix(top) and the clustered results for all KLT tracklets(middle) areshown for NGSIM (left) and NYCDOT (right) videos. The fi-nal extracted trajectories after spectral clustering and DPGMM areshown in the (bottom) panels.

ally intractable. To get around this constraint, we segmentvideos into snippets of ∼ 600 frames (corresponding toroughly 2 minutes of NYCDOT video1) and cluster withineach snippet. The relatively few tracklets which overlap twosnippets are propagated into the subsequent adjacency ma-trix thus unifying labels across snippets. An example of thefinal clustered tracks is shown in Figure 8.

3. Experimental Results and Evaluation

We evaluate our system performance on two differentdatasets: NYCDOT video recorded at Canal St & Baxter Stin Manhattan’s Chinatown and the NGSIM US-101 high-way dataset [22]. Both have dense traffic flows, but differin their camera orientation. The NYCDOT video is a street-level camera with many occlusions, speed variations, andperspective effects, while the NGSIM video has a bird’s eyeview with mostly parallel trajectories of uniform velocity.

Ground truth trajectories consisting of vehicle center andbounding box for each frame were manually generated for

1We derive KLT tracklets on raw 30 fps video for accurate tracking andthen down sample by a factor of 6 prior to any subsequent steps to reducecomputational complexity.

Page 6: Robust Vehicle Tracking for Urban Traffic Videos at ...vision.poly.edu/index.html/uploads/robust_tracking_final.pdfRobust Vehicle Tracking for Urban Traffic Videos at Intersections

Figure 9. An example of manually generated ground truth trajec-tories is shown. Red dots represent the centroid of the boundingbox (green) of the vehicle over time.

Figure 10. A schematic of our ground truth comparison. Trajec-tories are taken to be successfully captured if they fall within theground truth bounding box for at least 30% of its temporal dura-tion. (This is an over-segmentation case)

each dataset with custom software, see Figure 9. To com-pare this groundtruth with automatically derived trajecto-ries, we use two evaluation metrics proposed by [23], missrate and over-segmentation rate. Miss rate M is defined asthe ratio of the the number of missed vehicles to the totalnumber of moving vehicles, and over-segmentation rate Oas the number of vehicles that are represented by more thanone trajectory over the total number of vehicles.

For each ground truth trajectory, we compute the dis-tance of this trajectory and all automatically generated tra-jectories. We only consider successful tracking to be thosecases for which the automatically generated trajectory iswithin the manually generated bounding box for over 30%of its temporal duration. When a ground truth vehicle hasmore than one automatically generated trajectories as suc-cessful trackings, we consider this as an over-segmentationcase, see Figure 10 and 11.

We find that our tracker is quite successful on the street-level NYCDOT video with the miss rate M = 21% de-spite the substantial difficulties associated with that scene.A simple thresholding of the adjacency matrix yields a sub-stantially higher miss rate M = 53%. For the NGSIMdata set our method also performs quite well with miss rateM = 18%, slightly worse than M = 15% in [6], thoughwe point out that the details of our tracker were designed

Figure 11. A comparison of ground truth (green) and automaticallygenerated (red) trajectories is shown for the NGSIM video. Thetop panels show examples of successful tracking while the resultsin the bottom panels were unsuccessful: miss detection and over-segmentation.

for optimal performance on street-level video, not overheadvideo with nearly parallel, nearly uniform velocity intrinsictrajectories.

For both datasets, the over-segmenting cases can occur.Specifically we have an over-segmetation rate O = 31%for NGSIM dataset. Over-segmented trajectories can bemerged though a post-processing or clustering, which willsubstantially decrease the over-segmenting rate. It is alsoimportant to note that the ground truth trajectories are lo-cated at the centroids of the vehicles. Since the positions ofour final trajectory locations are defined as the median of theclustered tracklets at each frame, they may deviate from thevehicle centroids due to the incomplete KLT feature pointcoverage over the vehicle or even missed feature points.The final clustered result can also be smoothed throughpost-processing, which we will leave as future work.

4. Conclusion and Future Work

In this paper we have developed an automatic system forvehicle tracking designed for street-level video in urban en-vironments. Our method uses a combination of high-leveland low-level vehicle features to group feature point track-lets into coherent trajectories via spectral clustering. Wehave shown that this system can also be effectively appliedto bird’s eye view highway traffic video. Given the robust-ness of our method, even in very challenging video withmultiple occlusions, variable velocity, and a complex back-ground scene, the derived trajectories can be used to de-velop surrogate safety metrics for post facto traffic analysis.

Page 7: Robust Vehicle Tracking for Urban Traffic Videos at ...vision.poly.edu/index.html/uploads/robust_tracking_final.pdfRobust Vehicle Tracking for Urban Traffic Videos at Intersections

AcknowledgmentThis work was funded through a sponsored research

grant from the American International Group. Dr Gre-gory Dobler was partially supported by a Complex SystemsScholar grant from the James S. McDonnell Foundation.

References[1] 2014 Traffic Safety Fact, National High-

way Safety Administration, 2016. http://www-nrd.nhtsa.dot.gov/Pubs/812261.pdf

[2] Hauer, Ezra. “Traffic conflicts and exposure.” AccidentAnalysis & Prevention 14.5 (1982): 359-364. AP

[3] Gettman, Douglas, et al. Surrogate safety assessmentmodel and validation: Final report. FHWA-HRT-08-051. 2008.

[4] D. Bloisi and L. Iocchi, “ArgosA video surveillancesystem for boat traffic monitoring in Venice, in Proc.IJPRAI, 2009, pp. 14771502.

[5] Li, Jinling, et al. “Automated Region-Based Vehi-cle Conflict Detection Using Computer Vision Tech-niques.” Transportation Research Record: Journal ofthe Transportation Research Board 2528 (2015): 49-59.

[6] Kim, ZuWhan, and Jitendra Malik. “High-quality vehi-cle trajectory generation from video data based on ve-hicle detection and description.” Intelligent Transporta-tion Systems, 2003. Proceedings. 2003 IEEE. Vol. 1.IEEE, 2003.

[7] Coifman, Benjamin, et al. “A real-time computer visionsystem for vehicle tracking and traffic surveillance.”Transportation Research Part C: Emerging Technolo-gies 6.4 (1998): 271-288.

[8] Saunier, Nicolas, and Tarek Sayed. “A feature-basedtracking algorithm for vehicles in intersections.” Com-puter and Robot Vision, 2006. The 3rd Canadian Con-ference on. IEEE, 2006

[9] Jodoin, Jean-Philippe, Guillaume-Alexandre Bilodeau,and Nicolas Saunier. “Tracking All Road Users at Mul-timodal Urban Traffic Intersections.”

[10] Alahi, Alexandre, Raphael Ortiz, and Pierre Van-dergheynst. “Freak: Fast retina keypoint.” Computer vi-sion and pattern recognition (CVPR), 2012 IEEE con-ference on. Ieee, 2012.

[11] Shi, Jianbo, and Carlo Tomasi. “Good features totrack.” Computer Vision and Pattern Recognition, 1994.Proceedings CVPR’94., 1994 IEEE Computer SocietyConference on. IEEE, 1994.

[12] Cleveland, William S., and Susan J. Devlin. ”Locallyweighted regression: an approach to regression analy-sis by local fitting.” Journal of the American statisticalassociation 83.403 (1988): 596-610.

[13] Cands, Emmanuel J., et al. “Robust principal com-ponent analysis?.” Journal of the ACM (JACM) 58.3(2011): 11.

[14] Rodrguez, Paul. “Real-time incremental principalcomponent pursuit for video background modeling onthe TK1.” (2015).

[15] Rodriguez, Paul, and Brendt Wohlberg. “A matlab im-plementation of a fast incremental principal componentpursuit algorithm for video background modeling.” Im-age Processing (ICIP), 2014 IEEE International Con-ference on. IEEE, 2014.

[16] Fu, Zhouyu, Weiming Hu, and Tieniu Tan. “Similaritybased vehicle trajectory clustering and anomaly detec-tion.” Image Processing, 2005. ICIP 2005. IEEE Inter-national Conference on. Vol. 2. IEEE, 2005.

[17] Atev, Stefan, Osama Masoud, and Nikolaos Pa-panikolopoulos. “Learning Traffic Patterns at Intersec-tions by Spectral Clustering of Motion Trajectories.”IROS. 2006.

[18] Xin, Le, et al. “Traffic flow characteristic analysisat intersections from multi-layer spectral clustering ofmotion patterns using raw vehicle trajectory.” 2011 14thInternational IEEE Conference on Intelligent Trans-portation Systems (ITSC). IEEE, 2011.

[19] Ng, Andrew Y., Michael I. Jordan, and Yair Weiss.“On spectral clustering: Analysis and an algorithm.”Advances in neural information processing systems 2(2002): 849-856.

[20] Pedregosa, Fabian, et al. “Scikit-learn: Machine learn-ing in Python.” The Journal of Machine Learning Re-search 12 (2011): 2825-2830.

[21] Rasmussen, Carl Edward. “The Infinite Gaussian Mix-ture Model.” NIPS. Vol. 12. 1999.

[22] NGSIM dataset: http://www.ngsim.fhwa.dot.gov/.

[23] Saunier, Nicolas, Tarek Sayed, and Karim Ismail.“An object assignment algorithm for tracking perfor-mance evaluation.” 11th IEEE International Workshopon Performance Evaluation of Tracking and Surveil-lance (PETS 2009), Miami, Fl, USA, June. Vol. 25.2009.