10
1 Dynamics of Driver’s Gaze: Explorations in Behavior Modeling & Maneuver Prediction Sujitha Martin, Member, IEEE, Sourabh Vora, Kevan Yuen, and Mohan M. Trivedi, Fellow, IEEE Abstract—The study and modeling of driver’s gaze dynamics is important because, if and how the driver is monitoring the driving environment is vital for driver assistance in manual mode, for take-over requests in highly automated mode and for semantic perception of the surround in fully autonomous mode. We developed a machine vision based framework to classify driver’s gaze into context rich zones of interest and model drivers gaze behavior by representing gaze dynamics over a time period using gaze accumulation, glance duration and glance frequencies. As a use case, we explore the driver’s gaze dynamic patterns during maneuvers executed in freeway driving, namely, left lane change maneuver, right lane change maneuver and lane keeping. It is shown that condensing gaze dynamics into durations and frequencies leads to recurring patterns based on driver activities. Furthermore, modeling these patterns show predictive powers in maneuver detection up to a few hundred milliseconds a priori. Index Terms—Autonomous Driving, Naturalistic Driving Study, Control Transitions, Attention and Vigilance Metrics, Driver state and intent recognition I. I NTRODUCTION I NTELLIGENT vehicles of the future are that which, having a holistic perception (i.e. inside, outside and of the vehicle) and understanding of the driving environment, make it possible for occupants to go from point A to point B safely, comfortably and in a timely manner [1], [2]. This may happen with the driver in full control and getting active assistance from the robot, or the robot is in partial or full control and human drivers are passive observers “ready” to take over as deemed necessary by the machine or human [3], [4]. In the full spectrum from manual to autonomous mode, modeling the dynamics of driver’s gaze is of particular interest because, if and how the driver is monitoring the driving environment is vital for driver assistance in manual mode [5], for take- over requests in highly automated mode [6] and for semantic perception of the surround in fully autonomous mode [7], [8]. The driver’s gaze can be represented in many different ways, from directional vectors [9] to points in 3-D space [10], from static zones of interest (e.g. speedometer, side mirrors) [11] to dynamic objects of interest (e.g. vehicles, pedestrians) [12]. In this paper, gaze is represented using context rich static zones of interest. Using such a representation, when gaze is estimated over a period of time, higher semantic information such as fixations and saccades can be extracted and used to derive driver’s situational awareness, estimate engagement in secondary activities, predict intended maneuvers, etc. Figure 1 illustrates an example where the length of driver’s fixation on The authors are with the Laboratory for Intelligent and Safe Automo- biles, University of California San Diego, La Jolla, CA 92093 USA: (see http://cvrr.ucsd.edu/). Figure 1. An example illustration which showcases the importance of understanding and modeling what constitutes expected or ”attentive” gaze behavior in order to ensure safe and smooth transfer of control between robot and human. non-driving relevant region is an important factor to determine driver’s state and therefore, when the transfer of control should happen. In general, with rapid introductions of autonomous features in consumer vehicles, there is a need to understand and model what constitutes “normal” or “attentive” gaze behavior in order to ensure safe and smooth transfer of control between robot and human. An important criteria to build such models is naturalistic driving data, where driver is in full control. Using such data, we propose to build expected gaze behavior models for a given situation or activity and attempt to predict the presence or absence of such behavior on data unseen when training the models. For example, if we build a gaze model from left lane change events alone, then applying the model to new unlabeled data gives the likelihood of a left lane change event occurring or more abstractly, driver’s situational awareness necessary to make a left lane change. One of the challenges is in the mapping from spatio-temporal rich gaze dynamics to activities or events of interest. For example, when engaged in a secondary task which uses the center stack (e.g. radio, AC, navigation), the manner in which the driver looks at the center stack can be vastly different. One may perform the secondary task with one long glance away from the forward driving direction and at the center stack. Another time one arXiv:1802.00066v1 [cs.CV] 31 Jan 2018

Dynamics of Driver’s Gaze: Explorations in Behavior ...cvrr.ucsd.edu/publications/2018/sujitha_dynamics.pdf · Dynamics of Driver’s Gaze: Explorations in Behavior Modeling

  • Upload
    lamcong

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

1

Dynamics of Driver’s Gaze: Explorations inBehavior Modeling & Maneuver Prediction

Sujitha Martin, Member, IEEE, Sourabh Vora, Kevan Yuen, and Mohan M. Trivedi, Fellow, IEEE

Abstract—The study and modeling of driver’s gaze dynamicsis important because, if and how the driver is monitoring thedriving environment is vital for driver assistance in manualmode, for take-over requests in highly automated mode and forsemantic perception of the surround in fully autonomous mode.We developed a machine vision based framework to classifydriver’s gaze into context rich zones of interest and model driversgaze behavior by representing gaze dynamics over a time periodusing gaze accumulation, glance duration and glance frequencies.As a use case, we explore the driver’s gaze dynamic patternsduring maneuvers executed in freeway driving, namely, left lanechange maneuver, right lane change maneuver and lane keeping.It is shown that condensing gaze dynamics into durations andfrequencies leads to recurring patterns based on driver activities.Furthermore, modeling these patterns show predictive powers inmaneuver detection up to a few hundred milliseconds a priori.

Index Terms—Autonomous Driving, Naturalistic DrivingStudy, Control Transitions, Attention and Vigilance Metrics,Driver state and intent recognition

I. INTRODUCTION

INTELLIGENT vehicles of the future are that which, havinga holistic perception (i.e. inside, outside and of the vehicle)

and understanding of the driving environment, make it possiblefor occupants to go from point A to point B safely, comfortablyand in a timely manner [1], [2]. This may happen with thedriver in full control and getting active assistance from therobot, or the robot is in partial or full control and humandrivers are passive observers “ready” to take over as deemednecessary by the machine or human [3], [4]. In the fullspectrum from manual to autonomous mode, modeling thedynamics of driver’s gaze is of particular interest because,if and how the driver is monitoring the driving environmentis vital for driver assistance in manual mode [5], for take-over requests in highly automated mode [6] and for semanticperception of the surround in fully autonomous mode [7], [8].

The driver’s gaze can be represented in many different ways,from directional vectors [9] to points in 3-D space [10], fromstatic zones of interest (e.g. speedometer, side mirrors) [11] todynamic objects of interest (e.g. vehicles, pedestrians) [12].In this paper, gaze is represented using context rich staticzones of interest. Using such a representation, when gaze isestimated over a period of time, higher semantic informationsuch as fixations and saccades can be extracted and used toderive driver’s situational awareness, estimate engagement insecondary activities, predict intended maneuvers, etc. Figure 1illustrates an example where the length of driver’s fixation on

The authors are with the Laboratory for Intelligent and Safe Automo-biles, University of California San Diego, La Jolla, CA 92093 USA: (seehttp://cvrr.ucsd.edu/).

Figure 1. An example illustration which showcases the importance ofunderstanding and modeling what constitutes expected or ”attentive” gazebehavior in order to ensure safe and smooth transfer of control between robotand human.

non-driving relevant region is an important factor to determinedriver’s state and therefore, when the transfer of control shouldhappen.

In general, with rapid introductions of autonomous featuresin consumer vehicles, there is a need to understand and modelwhat constitutes “normal” or “attentive” gaze behavior in orderto ensure safe and smooth transfer of control between robotand human. An important criteria to build such models isnaturalistic driving data, where driver is in full control. Usingsuch data, we propose to build expected gaze behavior modelsfor a given situation or activity and attempt to predict thepresence or absence of such behavior on data unseen whentraining the models. For example, if we build a gaze modelfrom left lane change events alone, then applying the modelto new unlabeled data gives the likelihood of a left lanechange event occurring or more abstractly, driver’s situationalawareness necessary to make a left lane change. One of thechallenges is in the mapping from spatio-temporal rich gazedynamics to activities or events of interest. For example, whenengaged in a secondary task which uses the center stack (e.g.radio, AC, navigation), the manner in which the driver looks atthe center stack can be vastly different. One may perform thesecondary task with one long glance away from the forwarddriving direction and at the center stack. Another time one

arX

iv:1

802.

0006

6v1

[cs

.CV

] 3

1 Ja

n 20

18

2

may perform the secondary task via multiple short glancestowards the center stack, etc. However, while individual gazedynamic patterns are different, together they are associatedwith an activity of interest.

To this end, we present a machine vision based frameworkfocused on gaze-dynamics modeling and behavior predictionusing naturalistic driving data. Whereas a preliminary studyof this work using a single driver is presented in [13], thecontributions of this paper are as follows:

• A significant overview of related studies on gaze estima-tion and higher semantics with gaze.

• Naturalistic driving dataset composed of multiple drivers,and annotated with ground truth gaze zones and maneuverexecution (e.g. lane change, lane keeping).

• New metrics to quantitatively evaluate the performanceof gaze estimation over a time period as oppose toon individual frame level (i.e. metrics to evaluate gazeaccumulation which is computed over a time segment).

• Proper formulation and nomenclature of gaze-dynamicfeatures (i.e. gaze accumulation, glance duration, glancefrequency), and compare-and-contrast on the effect ofutilizing a combination of these features on behaviorprediction accuracy.

II. RELATED STUDIES

The work presented in this paper has three major compo-nents: gaze estimation, gaze behavior modeling and prediction,and performance evaluation. Table I reflects these attributes bydividing the columns into two major sections, gaze estimationand higher semantics with gaze. Some works present automaticgaze estimation frameworks but don’t proceed further, whilesome use manually annotated gaze data to study higher seman-tics. In the two studies which present work in both categories,the differences are subtle but significant; first is the numberof gaze zones, second is in the features used for behaviormodeling and prediction, third is in the performance evaluationof gaze zones and behavior prediction. Note that the studiespresented in Table I are selected based on two criteria: first,it must present work on driver’s gaze and second, evaluationis conducted on some level with on-road driving data.

A. On Gaze Estimation

In literature, works on gaze zone estimation are relativelynew and of those, there are two categories: geometric andlearning based methods. The work presented in [14] estimatesgaze zones based on geometric methods, where a 3-D modelcar is divided into different zones and 3D gaze tracking is usedto classify gaze into predefined zones; however, no evaluationson gaze zone level is given. Another geometric based methodis presented in [9], but the number of gaze zones estimated isvery limited (i.e. on-road versus off-road) and evaluations areconducted in stationary vehicles. In terms of learning basedmethods, there are two prevalent works. Work by Tawari etal. [11] has the most similarity to the work presented inthis paper in terms of the features selected (e.g. head pose,horizontal gaze surrogate), classifier used (i.e. random forest)and evaluation on naturalistic driving data. The difference is

that this work introduces another feature to augment the stateof the eyes (i.e. appearance descriptor), which allows for anincreased number of gaze zones, but not at the expense ofperformance, as shown by evaluating on a dataset composedof multiple drivers. Another learning based method is the workpresented by Fridman et al. [15] where the evaluations aredone on a significantly large dataset, but the design of thefeatures to represent the state of the head and eyes are whatis causing their classifier to over fit to user based models andto not generalize well with global based models.

B. On Gaze Behavior

In terms of gaze modeling and behavior understanding,literary works have mainly conducted studies in a drivingsimulator but few recent works from on-road driving haveemerged. In one on-road study, Birrell and Fowkes [16]explore the effects of using in-vehicle smart driving aid onglance behavior. The study uses glance durations and glancetransition frequencies to show difference in glance behaviorbetween baseline, normal driving and when using in-vehicledevices. Similarly, through manual annotations of glance timesand targets, Munoz et al. [17] analyzed glance allocationstrategies experimentally under three different situations, man-ual radio tuning, voice-based radio tuning and normal driving.In another on-road study, Li et al. [18] show that driversexhibit different gaze behaviors when engaged in secondarytasks versus baseline, normal driving by using mirror-checkingactions as indicators for differentiating between the two. Inaddition to gaze related features, Li et al. also employedfeatures from CAN-Bus and road camera when training todetect mirror-checking actions, which raises a question of ifthe system is actually learning what the driver should be doingrather than what the driver is doing. While most of the gazebehavior studies have largely centered on detecting the driver’sstate from driver’s glance allocation strategy, [19] goes beyondto ask whether external driving environment can be inferredfrom six seconds of driver glances.

In Table I, the gaze behavior related literature is dividedinto two different categories based on whether gaze estimationwas performed automatically or manually. Such a distinctionis presented in order to acknowledge works that have takeninto consideration the noise in gaze estimates when modelingor predicting driver behavior from gaze. Our work especiallyaddresses the effects of noisy gaze estimates on gaze behaviormodeling by quantitatively evaluating gaze dynamic features(see Section V.B), as indicated by column four in Table I.

III. FROM GAZE ESTIMATION TO DYNAMICSTO BEHAVIOR MODELING

In this section, methods related to vision based gaze es-timation, spatio-temporal rich gaze dynamics descriptors andbehavior prediction from gaze modeling are described.

A. Gaze Estimation

Gaze estimation is an important first step towards buildinggaze behavior models. As the emphasis of this work is on gaze

3

Table ISELECTED STUDIES ON VISION BASED GAZE ESTIMATION AND HIGHER SEMANTICS WITH GAZE WHICH ARE EVALUATED ON SOME LEVEL WITH ON-ROAD DRIVING DATA.

Gaze Estimation Higher Semantics with GazeNum. of Evaluation

Research Study Objective / Motivation Methodology Gaze Over conti- Accuracy Features Method Behavior / task / state PredictionZones -nuous time of interest accuracy

Tawari et al.,2014 [12]

Estimating driver attentionby simultaneous analysis ofviewer and view

Geometric Functionof salientobjects

No 46% and 79%with manual andautomatic detection,respectively, ofsalient objects

Not applicable Not applicable Not applicable Not applicable

Tawari, Chen &Trivedi, 2014[11]

Estimate drivers coarse gazedirection using both headand eye cues

Learning 6 No 80% with head posealone and 95% withhead plus eye cues

Not applicable Not applicable Not applicable Not applicable

Vicente et al.,2015 [9]

Detecting eyes off the road(EOR)

Geometric 2 No 90% EOR accuracy Not applicable Not applicable Not applicable Not applicable

Vasli, Martin &Trivedi, 2016[20]

Exploring the fusion of ge-ometric and data driven ap-proaches on driver gaze esti-mation

Geometricplus learning

3 No 75% with geometricand 94% with geo-metric plus learning

Not applicable Not applicable Not applicable Not applicable

Fridman et al.,2016 [15]

Exploring the effects of headpose and eye pose on gaze

Learning 6 No 89% with head posealone and 95% withhead and eye pose

Not applicable Not applicable Not applicable Not applicable

Birrell &Fowkes, 2014[16]

Investigates glance behaviorsof drivers when using Smart-phone application

Manualannotation

8 No Not applicable GL, GD, GF Not applicable Secondary task versusbaseline driving

Not applicable

Munoz et al.,2016 [17]

Predicting tasks based ondistinguishing patterns indriver’s visual attentionallocation

Manualannotation

11 No Not applicable GL, GD HMM Secondary tasks Min of 68% tomax of 96%

Fridman et al.,2016 [19]

Exploring what broad macroeye-movement reveal aboutstate of driver and drivingenvironment

Manualannotation

8 No Not applicable GD, GTF HMM Driving environment,driver behavior/state,driver demographiccharacteristic

Min of 52% tomax of 88%

Ahlstrom,Kircher &Kircher, 2013[14]

Investigate the usefulness ofa real-time distraction detec-tion algorithm called AttenD

Geometric Notavailable

No Not available GL, GD Rule based Attention to field rel-evant to driving

Not available

Li & Busso, 2016[18]

Detecting mirror checkingactions and its application tomaneuver and secondary taskrecognition

Learning 2 No Using all featuresfrom CAN, road camand face cam: 90%weighted and 96%unweighted accuracy

GL, GD, GF,CAN-Bus sig-nal, road dy-namics

LDC Vehicle maneuvers,secondary tasks

Min of 58% tomax of 76%

This work Estimating gaze dynamicsand investigating the predic-tive power of glance durationand frequency on driver be-havior

Learning 9 Yes 84% weightedaccuracy and mostlyabove 25% in ratioof estimated to truegaze accumulation

GL, GA, GD,GF

MVN Left/right lanechanges, lane keeping

Min of 78% andmax of 84%

GA = gaze accumulation, GL = glance location, GD = glance duration, GF=Glance frequency, HMM = Hidden Markov Model, LDC = Linear Discriminant Classifier, MVN = Multivariate Normal

4

behavior understanding, modeling and prediction, this workdoes not seek to claim major contribution in the domain ofgaze estimation. However, for the sake of self-containment,this section provides high level information on the modulesmaking up the gaze estimator and relevant references for moredetails. Key modules in this vision based gaze estimationframework, as illustrated in Figure 2, are as follows:

• Perspective Selection: A part hardware and part softwaresolution of distributed multi-perspective camera system,where each perspective is treated independently and aperspective is selected based on the dynamics and qualityof head pose; details on head pose estimation is givenbelow. Such a system is necessary to continuously andreliably track the head pose of driver during large headmovements [21].

• Face Detection: A deep CNN based system (withAlexNet as the base network) is trained on heavilyaugmented face datasets to include more examples offaces under harsh lighting and occlusion [22].

• Facial landmark estimation: The landmarks are esti-mated using a cascade of regression models as describedin [24], [23] with more details for iris localization givenin [11].

• Head pose estimation: A geometric method where localfeatures, such as eye corners, nose corners, and the nosetip, and their relative 3-D configurations, determine thepose [21].

• Horizontal gaze surrogate: The horizontal gaze-direction β with respect to head, see Figure 2, is estimatedas a function of α, angle subtended by an eye in hori-zontal direction, head-pose (yaw) angle θ with respect tothe image plane, and d1

d2, the ratio of the distances of iris

center from the detected corner of the eyes in the imageplane [11].

• Appearance descriptor: Appearance of the eye is rep-resented by computing HoG (Histogram of Gradients)[25] in a 2-by-2 patch around the eye. This descriptoris especially designed to capture the vertical gaze of theeyes.

• Gaze zone estimation: Eight semantic gaze-zones ofinterest are, far left, left, front, speedometer, rear view,center stack, front right and right, as illustrated in Figure2. Another class of interest, but not illustrated in thefigure, is the state of eyes closed. Consider a set of featurevectors ~F = {~f1, ~f2, . . . , ~fN}, and their correspondingclass labels X = {x1, x2, . . . , xN}, for N sample in-stances. Here a feature vector is a concatenation of headpose and eye cues described above and class labels areone of nine gaze zones. Given ~F and X , a random forest(RF) is trained on the corpus.

B. Spatio-Temporal Feature Descriptor

The gaze estimator, as described in the previous section,outputs where the driver is looking in a given instance. Acontinuous segment of gaze estimates of where the driver hasbeen looking is referred to as a scanpath. Figure 3 illustratesmultiple scanpaths in a 10-second time window around lane

Figure 2. A illustrative block diagram showing the process of estimating gazezone from time of capture from multiple camera perspectives to classifyinggaze into one of nine gaze zones (i.e. eight gaze zones illustrated aboveand “eyes closed”). Key modules in the system include deep CNN basedface detection [22], landmark estimation [23], horizontal gaze surrogate [11],appearance descriptor and head pose based perspective selector [21].

change, two scanpaths from left lane change and two scanpathsfrom right lane change events. In the figure, the x-axis repre-sents time and the color displayed at a given time t representsthe estimated gaze zone. Let SyncF denote the time whenthe tire touches the lane marking before crossing into the nextlane, which is the ”0-seconds” displayed in the figure. Visually,in the 5-second time period before SyncF, there is someconsistency observed across the different scanpaths within agiven event (e.g. left lane change); consistencies such as theminimum glance duration in relevant gaze zones. For example,in the scanpaths associated with right lane change, the driver

5

(a) Two Left Lane Change Events (b) Two Right Lane Change EventsFigure 3. Illustrates four different scanpaths during a 10-second time window prior to lane change, two scanpaths during left lane change and two scanpathsduring right lane change event, with sample face images from various gaze zones. Consistencies such as total glance duration and number of glances toregions of interest within a time window are useful to note when describing the nature of driver’s gaze behavior. Such consistencies can be used as featuresto predict behaviors. See Figure 2 for a legend of which color is associate with which gaze zone.

glances at the rearview and right gaze zones for a significantduration. However, the start and end point of the glancesare not necessarily the same across the different scanpaths.Therefore, we represent the scanpaths using features calledgaze accumulation, glance frequency and glance duration,which remove some temporal dependencies but still capturesufficient spatio-temporal information to distinguish betweendifferent gaze behaviors.

As these features are computed over a time window, we,first, define signals necessary to compute them. Let Z repre-sent the set of all nine gaze zones as Z = {Front, Right, Left,Center Stack, Rearview, Speedometer, Left Shoulder, RightWindshield, Eyes Closed} and let L = |Z| represent the totalnumber of gaze zones. Let the vector G = [g1, g2, . . . , gN ]represent the estimated gaze for an arbitrary time period ofT , where N = fps(frames per second) × T , gn ∈ Z, andn ∈ {1, 2, . . . , N}. The following description defines howto compute gaze accumulation, glance duration and glancefrequencies given G.

1) Gaze Accumulation: Gaze accumulation is a vector ofsize L, where each entry is a function of a unique gaze zone.Given a gaze zone, gaze accumulation is the accumulatedsum of the number of times driver looked at the zone ofinterest within a time period; which is then normalized bythe time window for relative accumulation. Mathematically,gaze accumulation at gaze zone zj , where j ∈ {1, 2, ..., L}corresponds to the jth gaze zone in Z, is as follows:

Gaze Accumulation (zj) =1

N∑n=1

1(gn == zj)

where 1(•) is the indicator function.2) Glance frequency: Glance frequency is a vector of size

L, where each entry is a function of a unique gaze zone.Within time period T , every time there is a transition fromone gaze zone to another (e.g. Front to Speedometer), theglance count for the destination gaze zone is incremented;the count is then normalized by the time period to produceglance frequency. Under the condition that estimates are noisefree, glance frequency for each of the gaze zones, zj , where

j ∈ {1, 2, ..., L} corresponds to the jth gaze zone in Z, iscalculated as follows:

Glance Frequency (zj)

=1

N∑n=2

1(gn == zj)× 1(gn−1 6= zj)

However, since gaze estimates are noisy, a majority rule overa buffered window is necessary to acknowledge transition intoa new gaze zone. Algorithm 1 details calculation of the glancefrequency while accounting for noisy estimation.

3) Glance Duration: Glance duration is a vector of size L,where each entry is a function of a unique gaze zone. Given agaze zone, glance duration is the longest glance made towardsthe gaze zone of interest within time window N . Followingthe same process as in Algorithm 1, in addition to countingwhen transitions to new gaze zones occur, the start and endof each continuous glance can also be tracked. For gaze zonezj , let Szj be a [Nj × 2]-matrix indicating the start and endindex of Nj = CG(zj) continuous glance, where CG(zj)is the number of continuous glances to zj as computed inAlgorithm 1. Glance duration for each of the gaze zones, zj ,where j ∈ {1, 2, ..., L} corresponds to the jth gaze zone inZ, is calculated as follows:

Glance Duration (zj) =

max1≤n≤Nj

∣∣δ(Szj (n, :)∣∣ if Nj > 0

0 if Nj = 0

where δ(•) is the difference operator.The final feature vector, ~h, representing a scanpath then is

made up of a combination of the above described descriptors.In particular, this paper will explore the benefits of repre-senting a scanpath in three different ways: gaze accumulationalone, glance duration alone and glance duration plus glancefrequency.

C. Gaze Behavior Modeling

Consider a set of feature vectors ~H = {~h1,~h2, . . . ,~hN},and their corresponding class labels Y = {y1, y2, . . . , yN}.In this paper, the class labels will be maneuvers: Left Lane

6

Algorithm 1: To compute a vector of glance frequenciesgiven noisy estimates of gaze zones over time period Twith N frames.

input : G = [g1, g2, . . . , gN ] are noisy gaze estimatesW , a positive time window threshold forconsistency check, < N

output: A vector, FG, of frequency of glances

LastGazeState= g1for i←W to N do

if gi 6= LastGazeState thenif Majority (gi == [gi−1 · · · gi−W ]) then

CG(gi) + +;LastGazeState = gi;

endendi++;

endFG = 1

T × CG;

Change, Right Lane Change, Lane Keeping. The gaze behav-iors of respective events, tasks or maneuvers, are then mod-eled using an unnormalized multivariate normal distribution(MVN):

Mb(~h) = exp

(−1

2(~h− ~µb)

T Σ−1b (~h− ~µb)

)where b ∈ B = {Left Lane Change, Right Lane Change,

Lane Keeping}, and µb and Σb represent mean and covari-ance computed over the training feature vectors for the gazebehavior represented by b. One of the reasons for modelinggaze behavior in such a way is, given a new test scanpathdescriptor, ~htest, we want to know how does it compareto the average scanpath computed for each gaze behaviorin the training corpus. One possibility is to compute theeuclidean distance between the average scanpath descriptor,µb, and the test scanpath descriptor, ~htest, for all b ∈ B,and assign the label with the shortest distance. However, thisassigns equal weight or penalty to every component in ~h.The weights, however, should be a function of componentas well as behavior under consideration. Therefore, we usethe Mahalanobis distance, which assigns weights appropriatelybased on expected variance in the training data. Furthermore,by exponentiating the Mahalanobis distance to produce theunnormalized MVN, the range is mapped between 0 and 1. Toa degree this can be used to asses the probability or confidencethat a certain test scanpath represented by its descriptor, ~htest,belongs to a particular gaze behavior model.

IV. EXPERIMENTAL DESIGN AND ANALYSIS

A. Naturalistic Driving Dataset

A large corpus of naturalistic driving dataset is collectedusing an instrumented vehicle testbed. The vehicular testedis instrumented to synchronously capture data from camerasensors for looking-in and looking-out, radars, LIDARs, GPSand CAN bus. Of interest in this study are two camera sensorslooking at the driver (i.e. one near the rearview mirror and

Table IIDESCRIPTION OF ANALYZED ON-ROAD DRIVING DATA.

Duration No. of EventsDriver Full drive Left Lane Right Lane Lane

ID [min] Change Change Keeping1 52.10 9 5 202 24.25 5 5 603 28.13 5 4 504 36.38 10 4 325 39.20 10 4 456 27.49 6 5 807 37.50 5 5 46

All 273.30 50 32 333

another near the A-pillar) and one camera sensor looking-outin the driving direction. As the focus of this study is in drivergaze dynamics, the looking out view is only used to providecontext for data mining. Using the same instrumented vehicletest, seven drivers of varying driving experience drove the caron different routes for an average of 40 minutes (see TableII). Each drive consisted of some parts in urban settings, butmostly in freeway settings with multiple lanes. As the driversare familiar with the area and were given the independenceto design their own routes, the dataset contains natural glancebehavior during driving maneuvers.

From the collected dataset of seven drivers, several types ofannotations were done

• Gaze zone annotation of approximately equal numberof samples with respect to the gaze-zones for all sevendrivers. Each sample was annotated only when the an-notator was highly confident that without ambiguity thesample falls into one of nine gaze-zone classes. Someannotated samples are from consecutive video frameswhile others are not. For a full description, the readersare referred to [26]. Let’s call this the Gaze-zone-dataset.

• Left and right lane change event annotations for alldrivers. As a point of synchronization, for lane changeevents, when the vehicle tire is about to cross over intothe other lane, it is marked at annotation and denoted asSyncF. A 20-second window centered on SyncF, makesup the event. At the time of training and testing, however,gaze dynamics is computed on a sliding 5-second window(see Section IV-C). Accumulated number of these eventsper driver and overall are given in Table II. Let’s call thisthe Lane-change-events-dataset.

• Lane keeping event annotations for all drivers. Lengthystretches of lane keeping (as seen from looking-outcamera) are broken into non-overlapping 5-second timewindow segments to create lane keeping events. TableII contains the number of such events annotated andconsidered for the following analysis. Let’s call this theLane-keeping-events-dataset.

• Gaze zone annotation of every frame in the Lane-change-events-dataset. When annotating each of the 20-secondcontinuous video segment, human annotators had to makea choice between one of the nine zones or unknown.When ambiguous samples arose, the annotators usedtemporal information and outside context to make thecall. Unknown was highly discouraged to be used except

7

Figure 4. Evaluation of our gaze estimator on Gaze-zone-datase which iscomprised of balanced samples with respect to gaze zones for each of theseven drivers. The confusion matrix is generated from a leave one driver outcross-validation, where the rows are true classes and columns are estimates.The rows as displayed may not sum to one because of Unknown-class.

for transitions between zones. Let’s call this the Gaze-dynamics-dataset.

B. Evaluation of Gaze Dynamics

Of the literary works listed in Table I which estimate gazeautomatically, many of them output one of a number of gazezones of interest. In those works, performance evaluation oftheir gaze estimator is presented in terms of a confusion matrixon what percent are correctly classified and what percent aremisclassified with respect to the gaze zones. The advantageof such a presentation of evaluation is that when gaze zonesare classified incorrectly, it shows what it is misclassifiedinto and more often than not, the misclassification occurs inspatially neighboring zones (e.g. Front and Speedometer). Asfor the dataset over which evaluation occurs, no guaranteeis given that consecutive frames are annotated; in fact in[11], annotations were done every 5 frames. As a point ofcomparison, Figure 4 presents results of our gaze estimator(as described in Section III-A) on the Gaze-zone-dataset asa confusion matrix with a weighted accuracy of 83.5% (i.e.accuracy is calculated per gaze zone and averaged over allgaze zones) from leave one driver out cross-validation.

There are two important and often overlooked facts aboutthis form of evaluation when considering the application of thegaze estimator on continuous video sequences and the interestis on semantics like gaze accumulation, glance durations andglance frequencies. First is the lack of evaluation on images orvideo frames where driver’s gaze is in transition between twogaze zones. Since the gaze estimator is not explicitly trainedto classify transition states, the gaze estimator is expectedto ideally classify those transition instances into one of thetwo gaze zones that it is in transition between. Second isthe lack of metrics to garner the effects of misclassificationerror on a continuous segment. For instance, consider thehighest misclassification rate between front right windshieldand rearview mirror seen in the confusion matrix in Figure 4.When does the misclassification occur? In the periphery of acontinuous glance, in the middle of a glance or in transition

between glances? Depending on the type of misclassification,it will affect glance duration and frequency calculation.

The first limitation is addressed by creating the Gaze-dynamics-dataset. To address the latter limitation, this paperintroduces two performance evaluation metrics for gaze dy-namics with respect to gaze accumulation. One is the ratioof estimated gaze accumulation to true gaze accumulation pergaze zone:

Relative ratio of AG (zj) =

{AG(zj)AG(zj)

if AG(zj) 6= 0

0 if AG(zj) = 0(1)

where AG(zj) is the gaze accumulation calculated fromground truth annotation of gaze zones over a time period forgaze zone zj and AG(zj) is the gaze accumulation calculatedfrom estimated gaze zones over a time period for gaze zonezj . Note that in a given time window, only true positivegaze zone accumulation is considered with the first metric.Therefore, the second metric is designed to account for falsegaze accumulations:

Abs error of AG (zj) =

{0 if AG(zj) 6= 0

AG(zj) if AG(zj) = 0(2)

These new performance metrics are applied to the Gaze-dynamics-dataset, where each of the 20-second videos arebroken into 5-second segments with up to 4-second overlapsresulting in a total of 1312 samples. The performance isillustrated using a violin plot in Figure 5; a violin plot is adistribution of the output of the metrics over all the samples inrespective gaze zone classes. Ideally, the ratio metric (Eq. 1)is concentrated around 1, however, as seen in Figure 5, onlythe front gaze zone follows such a pattern. This result showspromise of accurately detecting attention versus inattention tothe forward driving direction because the ratio of estimatedto true gaze accumulation is highly concentrated around 1for Front gaze zone. Meanwhile, for other gaze zones, inmajority of the samples, estimated gaze accumulation is lessthan the true gaze accumulation; this is mainly because glancestowards these regions are significantly shorter in duration whencompared to glance towards Front and therefore more proneto noisy estimates.

The second metric (Eq. 2) tries to answer the followingquestions: what happens when given a time period, groundtruth annotations do not contain any annotations of a particulargaze zone but the gaze estimator produces false positives? Arethe false positives sparse or significant in time? According toFigure 5b, the false gaze accumulations are small relative tothe 5-second window over which the gaze accumulation iscalculated. Ideally, when calculating gaze accumulation overa time segment of estimated gaze zones, the ratio metric (Eq.1) should be around one and the absolute error metric (Eq. 2)should be around zero, meaning when true positives occur thedurations of the estimated glances is close to durations of thetrue glances and when false positives occur the durations ofthose falsely estimated glances are negligibly small.

8

(a) Gaze accumulation ratio of true positives (b) Gaze accumulation error of false positivesFigure 5. Performance evaluation of the gaze zone estimator is presented as a violin plot, which is a relative distribution of applying the following twometrics to all the samples in the Gaze-dynamics-dataset: (a) ratio of estimated to true gaze accumulation (Eq. 1) and (b) absolute error in estimated gazeaccumulation due to false positives (Eq. 2). The width of the violin at respective values of the y-axis dictates the relative likelihood of the value for the gazezone in the x-axis. Ideally, the width is largest for the ratio metric around one and for the absolute error metric around zero, meaning when true positivesoccur the durations of the estimated glances is close to durations of the true glances and when false positives occur the durations of those estimated glancesare negligibly small.

(a) Glance Duration + Frequency (b) Glance Duration only (c) Gaze accumulationFigure 6. The recall accuracy of lane change prediction (averaged and cross-validated across all drivers in the naturalistic driving scenarios) continuouslyfrom -5 seconds to 0 seconds prior to lane change for three different combinations of spatio-temporal feature descriptors: (a) glance duration plus glancefrequency, (b) glance duration only and (c) gaze accumulation only. LLC stands for left lane change and RLC stands for right lane change

C. Evaluation on Gaze Modeling

All evaluations conducted in this study is done with a seven-fold cross validation; seven because there are seven differentdrivers as outlined in Table II. With this setup of separating thetraining and testing samples, we explore the recall accuracyof the gaze behavior model in predicting lane changes as afunction of time (Figure 6).

Training occurs on the 5-second time window before SyncFas represented by the events in Table II. Note that, manuallyannotated gaze zones are used to compute the spatio-temporalfeatures used to train the lane change models whereas esti-mated gaze zones are used to train the lane keeping model.In testing, however, only estimated gaze-zones are used tocompute spatio-temporal features.

At testing time, we want to test how early the gaze behaviormodels are able to predict lane change. Therefore, startingfrom 5-seconds before SyncF sequential samples with 1

30 ofa second overlap are extracted up to 5-seconds after SyncF;note that the time window at 5-seconds before the SyncFencompasses data from 10 seconds before the SyncF up to 5-

seconds before the SyncF. Each of the samples are tested forfitness across the three gaze behavior models, namely modelsfor left lane change, right lane change and lane keeping.The sample is assigned the label based on the model whichprocures the highest fitness score and if the label matches thetrue label the sample is considered a true positive. Note thateach test sample is associated with a time index of where it issampled from with respect to SyncF. By gathering samples atthe same time index with respect to SyncF, recall value at agiven time index is calculated by dividing the number of truepositives by the total number of positive samples.

When calculating recall values, true labels of samples wereremapped from three classes to two classes; for instance,when computing recall values for left lane change prediction,all right lane change events and lane keeping events wereconsidered negatives samples and only the left lane changeevents are considered positive samples. Similar procedure isobserved for computing recall values for right lane changeprediction. Figure 6 shows the development of the recall valuesfor both left and right lane change prediction continuouslyfrom -5 seconds prior to SyncF up to 0 milliseconds prior

9

(a) Left Lane Change Events (DriverID 2) (b) Right Lane Change events (DriverID 2)

(c) Left Lane Change Events (Driver ID 5) (d) Right Lane Change events (DriverID 5)

Figure 7. Illustrates variations in fitness of the three models (i.e. Left lane change, Right lane) during left and right lane change maneuvers for two differentdrivers, where mean and standard deviation are depicted with solid line and semitransparent shades, respectively.

to SyncF for three different combinations of features (i.e.glance duration plus frequency, glance duration only and gazeaccumulation only) and for two events (i.e. left lane changeand right lane change). As expected the recall curves risein accuracy the closer in time to the lane change event.Also expected is the performance difference with respect tothe spatio-temporal features; whereas modeling with gazeaccumulation alone achieves above 75% accuracy at 1000milisecond prior to lane change, modeling with glance durationalone and glance duration plus frequency achieves about 60%and 40% accuracy, respectively. One possible reason for thestark difference in performance when using gaze accumulationversus glance duration and frequency is the latter may varyacross drivers more than the former. For example, one drivermay exhibit short glances with high frequency whereas anotherdriver may make long glances with low frequency. However,gaze accumulation neatly maps the differences in gaze behav-ior to one “attention” allocation domain and therefore gives thebest performance under given modeling methods and dataset.

Lastly, in Figure 7, we illustrate the fitness or confidenceof the learned models around left and right lane changemaneuvers for two different drivers. The figure shows mean(solid line) and standard deviation (semitransparent shades) ofthree models (i.e. left lane change, right lane change, lanekeeping) using the events from naturalistic driving datasetdescribed in Table II. The model confidence statistics areplotted 5 seconds before and after the lane change maneuver,where time of 0 seconds represents when the vehicle is aboutto change lanes. Interestingly, even though early versus latepeaks of the appropriate model can be uniquely differentacross drivers and maneuvers, the satisfactory separation of

the lane change models to lane keeping model and the spreadin dominance of the correct model shows promise in modelingdriver behavior using gaze dynamics to anticipate activities andmaneuvers.

V. CONCLUDING REMARKS

In this study, we explored modeling driver’s gaze behaviorin order to predict maneuvers performed by drivers, namelyleft lane change, right lane change and lane keeping. Theparticular model developed in this study features three majoraspects: one is the spatio-temporal features to represent thegaze dynamics, second is in defining the model as the averageof the observed instances, third is in the design of the metricfor estimating fitness of model. Applying this frameworkin a sequential series of time windows around lane changemaneuvers, the gaze models were able to predict left and rightlane change maneuver with an accuracy above 75% around1000 milliseconds before the maneuver.

The overall framework, however, is designed to modeldriver’s gaze behavior for any tasks or maneuvers performedby driver. In particular, the spatio-temporal feature descrip-tor composed of gaze accumulation, glance duration andglance frequency are powerful tools to capture the essenceof recurring driver gaze dynamics. To this end, there aremultiple future directions in site. One is to quantitativelydefine the relationship between the time window from whichto extract those meaningful spatio-temporal features and thetask or maneuvers performed by driver. Other future directionsare in exploring and comparing different temporal modelingapproaches and generative versus discriminative models.

10

ACKNOWLEDGMENT

The authors would like to thank the reviewers and theeditors for their constructive and encouraging feedback, andtheir colleagues at the Laboratory of Intelligent and SafeAutomoilbles (LISA). The authors gratefully acknowledgethe support of UC Discovery Program and industry partners,especially Fujitsu Ten and Fujitsu Laboratories of America.

REFERENCES

[1] M. M. Trivedi, T. Gandhi, and J. McCall, “Looking-in and looking-outof a vehicle: Computer-vision-based enhanced vehicle safety,” IntelligentTransportation Systems, IEEE Transactions on, 2007.

[2] A. Doshi and M. M. Trivedi, “Tactical driver behavior prediction andintent inference: A review,” in Conference on Intelligent TransportationSystems. IEEE, 2011.

[3] J. S. International, “Taxonomy and definitions for terms related to on-road motor vehicle autmated driving systems,” 2014.

[4] S. M. Casner, E. L. Hutchins, and D. Norman, “The challenges ofpartially automated driving,” Communications of the ACM, 2016.

[5] A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car thatknows before you do: Anticipating maneuvers via learning temporaldriving models,” ICCV, 2015.

[6] C. Gold, D. Dambock, L. Lorenz, and K. Bengler, “take over! how longdoes it take to get the driver back into the loop?” in Proceedings ofthe Human Factors and Ergonomics Society Annual Meeting. SAGEPublications, 2013.

[7] A. Tawari and B. Kang, “A computational framework for driver’s visualattention using a fully convolutional architecture,” in Intelligent VehiclesSymposium (IV). IEEE, 2017.

[8] A. Palazzi, F. Solera, S. Calderara, S. Alletto, and R. Cucchiara,“Learning where to attend like a human driver,” in Intelligent VehiclesSymposium (IV). IEEE, 2017.

[9] F. Vicente, Z. Huang, X. Xiong, F. De la Torre, W. Zhang, and D. Levi,“Driver gaze tracking and eyes off the road detection system.” IEEE.

[10] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Its written all over yourface: Full-face appearance-based gaze estimation,” in IEEE Conferenceon Computer Vision and Pattern Recognition Workshops, 2017.

[11] A. Tawari, K. H. Chen, and M. M. Trivedi, “Where is the driver looking:Analysis of head, eye and iris for robust gaze zone estimation,” inConference on Intelligent Transportation Systems (ITSC). IEEE, 2014.

[12] A. Tawari, A. Møgelmose, S. Martin, T. B. Moeslund, and M. M.Trivedi, “Attention estimation by simultaneous analysis of viewer andview,” in 17th International IEEE Conference on Intelligent Transporta-tion Systems (ITSC). IEEE, 2014.

[13] S. Martin and M. M. Trivedi, “Gaze fixations and dynamics for behaviormodeling and prediction of on-road driving maneuvers,” in IntelligentVehicles Symposium Proceedings, 2017.

[14] C. Ahlstrom, K. Kircher, and A. Kircher, “A gaze-based driver dis-traction warning system and its effect on visual behavior,” IntelligentTransportation Systems, IEEE Transactions on, 2013.

[15] L. Fridman, J. Lee, B. Reimer, and T. Victor, “Owl and lizard: Patternsof head pose and eye pose in driver gaze classification,” IET ComputerVision, 2016, In Print.

[16] S. A. Birrell and M. Fowkes, “Glance behaviours when using anin-vehicle smart driving aid: A real-world, on-road driving study,”Transportation research part F: traffic psychology and behaviour, 2014.

[17] M. Munoz, B. Reimer, J. Lee, B. Mehler, and L. Fridman, “Distin-guishing patterns in drivers visual attention allocation using hiddenmarkov models,” Transportation research part F: traffic psychology andbehaviour, 2016.

[18] N. Li and C. Busso, “Detecting drivers’ mirror-checking actions andits application to maneuver and secondary task recognition,” IEEETransactions on Intelligent Transportation Systems, 2016.

[19] L. Fridman, H. Toyoda, S. Seaman, B. Seppelt, L. Angell, J. Lee,B. Mehler, and B. Reimer, “What can be predicted from six secondsof driver glances?” 2017.

[20] B. Vasli, S. Martin, and M. M. Trivedi, “On driver gaze estimation:Explorations and fusion of geometric and data driven approaches,” inInternational Conference on Intelligent Transportation Systems (ITSC).IEEE, 2016.

[21] S. Martin, A. Tawari, and M. M. Trivedi, “Monitoring head dynamicsfor driver assistance systems: A multi-perspective approach,” in IEEEInternational Conference on Intelligent Transportation Systems, 2013.

[22] K. Yuen, S. Martin, and M. Trivedi, “On looking at faces in anautomobile: Issues, algorithms and evaluation on naturalistic drivingdataset,” in IEEE International Conference on Pattern Recognition.Citeseer, 2016.

[23] X. Xiong and F. De la Torre, “Supervised descent method and its appli-cations to face alignment,” in Computer Vision and Pattern Recognition(CVPR), 2013 IEEE Conference on. CVPR, 2013.

[24] X. P. Burgos-Artizzu, P. Perona, and P. Dollar, “Robust face landmarkestimation under occlusion,” in IEEE Intl. Conf. Computer Vision, 2013.

[25] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Computer Vision and Pattern Recognition. IEEE, 2005.

[26] S. Vora, A. Rangesh, and M. M. Trivedi, “On generalizing drivergaze zone estimation using convolutional neural networks,” in IEEEIntelligent Vehicles Symposium, 2017.

Sujitha Martin received the BS degree in Elec-trical Engineering from the California Institute ofTechnology in 2010 and the MS and Ph.D degreein electrical engineering from the University ofCalifornia, San Diego (UCSD) in 2012 and 2016,respectively. She is currently a research scientistat Honda Research Institute USA. Her researchinterests are in machine vision and learning, with afocus on human-centered, collaborative, intelligentsystems and environments. She helped organize twoworkshops on analyzing faces at the IEEE Intelligent

Vehicles Symposium (IVS 2015, 2016) and the first Women in IntelligentTransportation Systems (WiTS) meet and greet networking event at the IEEEIVS (2017). She is recognized as one of top female graduates in the fieldsof electrical and computer engineering and computer science at Rising Stars2016 hosted by Carnegie Mellon University.

Sourabh Vora received his BS degree in Electronicsand Communications Engineering (ECE) from BirlaInstitute of Technology and Science (BITS) Pilani- Hyderabad Campus. He received his MS degreein Electrical and Computer Engineering (ECE) fromUniversity of California, San Diego (UCSD) wherehe was associated with the Computer Vision andRobotics Research (CVRR) Lab. His research inter-ests lie in the field of Computer Vision and MachineLearning. He is currently working as a ComputerVision Engineer at nuTonomy, Santa Monica.

Kevan Yuen Kevan Yuen received the B.S. andM.S. degrees in electrical and computer engineeringfrom the University of California, San Diego, LaJolla. During his graduate studies, he was with theComputer Vision and Robotics Research Laboratory,University of California, San Diego. He is currentlypursuing a PhD in the field of advanced driver assis-tance systems with deep learning, in the Laboratoryof Intelligent and Safe Automobiles at UCSD.

Mohan Manubhai Trivedi is a Distinguished Pro-fessor of and the founding director of the UCSDLISA: Laboratory for Intelligent and Safe Auto-mobiles, winner of the IEEE ITSS Lead Institu-tion Award (2015). Currently, Trivedi and his teamare pursuing research in distributed video arrays,human-centered self-driving vehicles, human-robotinteractivity, machine vision, sensor fusion, and ac-tive learning. Trivedi’s team has played key rolesin several major research initiatives. Some of theprofessional awards received by him include the

IEEE ITS Society’s highest honor “Outstanding Research Award” in 2013,Pioneer Award (Technical Activities) and Meritorious Service Award by theIEEE Computer Society, and Distinguished Alumni Award by the Utah StateUniversity and BITS, Pilani. Three of his students were awarded “BestDissertation Awards” by professional societies and 20+ “Best” or “HonorableMention” awards at international conferences. Trivedi is a Fellow of the IEEE,IAPR and SPIE. Trivedi regularly serves as a consultant to industry andgovernment agencies in the U.S., Europe, and Asia.