ST-GRAT: A Novel Spatio-temporal Graph Attention Network

ST-GRAT: A Novel Spatio-temporal Graph Attention Network forAccurately Forecasting Dynamically Changing Road Speed

1Cheonbok Park†, 2Chunggi Lee, 3Hyojin Bahng, 3Yunwon Tae, 4Seungmin Jin, 2Kihwan Kim,2Sungahn Ko∗, and 5Jaegul Choo∗

1 NAVER Corp, [email protected] Ulsan National Institute of Science and Technology (UNIST), {cglee, kh1875, sako}@unist.ac.kr

3 Korea University, {hjj522, tyj204}@korea.ac.kr4National Research University Higher School of Economics (NRU-HSE), [email protected]

5 Korea Advanced Institute of Science and Technology (KAIST), [email protected]

ABSTRACTPredicting road traffic speed is a challenging task due to differ-ent types of roads, abrupt speed change and spatial dependenciesbetween roads; it requires the modeling of dynamically changingspatial dependencies among roads and temporal patterns over longinput sequences. This paper proposes a novel spatio-temporal graphattention (ST-GRAT) that effectively captures the spatio-temporaldynamics in road networks. The novel aspects of our approachmainly include spatial attention, temporal attention, and spatialsentinel vectors. The spatial attention takes the graph structureinformation (e.g., distance between roads) and dynamically adjustsspatial correlation based on road states. The temporal attention isresponsible for capturing traffic speed changes, and the sentinelvectors allow the model to retrieve new features from spatiallycorrelated nodes or preserve existing features. The experimentalresults show that ST-GRAT outperforms existing models, especiallyin difficult conditions where traffic speeds rapidly change (e.g., rushhours). We additionally provide a qualitative study to analyze whenand where ST-GRAT tended to make accurate predictions duringrush-hour times.

CCS CONCEPTS• Information systems→ Spatial-temporal systems; •Mathemat-ics of computing → Time series analysis.

KEYWORDStraffic prediction, graph neural networks, spatial-temporal model-ing, attention networks,time-series prediction

ACM Reference Format:Cheonbok Park, Chunggi Lee, Hyojin Bahng, Yunwon Tae, Seungmin Jin,

∗ Corresponding Authors†Work done in korea University.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’20, October 19–23, 2020, Virtual Event, Ireland© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6859-9/20/10. . . $15.00https://doi.org/10.1145/3340531.3411940

Kihwan Kim, Sungahn Ko, and Jaegul Choo. 2020. ST-GRAT: A Novel Spatio-temporal Graph Attention Network for Accurately Forecasting Dynami-cally Changing Road Speed. In Proceedings of the 29th ACM InternationalConference on Information and Knowledge Management (CIKM ’20), Oc-tober 19–23, 2020, Virtual Event, Ireland. ACM, NY, NY, USA, 10 pages.https://doi.org/10.1145/3340531.3411940

1 INTRODUCTIONPredicting traffic speed is a challenging task, as a prediction methodneeds not only to find innate spatial-temporal dependencies amongroads, but also needs to understand how these dependencies changeover time and influence other traffic conditions. For example, whena road is congested, there is a high chance that its neighboringroads are also congested. Moreover, roads in residential areas tendto have different traffic patterns compared to those surroundingindustrial complexes [10].

Numerous deep learning models [15, 29] have been proposed fortraffic speed prediction based on graph convolution neural networks(GCNNs) with recurrent neural networks (RNNs), outperformingconventional approaches [24]. For example, a diffusion convolu-tion recurrent neural network (DCRNN) [11] combines diffusionconvolution [1] with an RNN and demonstrates improved predic-tion accuracy. Graph WaveNet [25] adapts diffusion convolution,incorporates a self-adaptive adjacency matrix, and uses dilated con-volution for achieving state-of-the-art performance. However, themodels assume fixed spatial dependencies among roads, so thatthey compute spatial dependencies once and use the computed de-pendencies all the time without considering dynamically changingtraffic conditions.

To address this issue, recent models [28, 30] utilize multi-headattention [22] to model spatial dependencies. Despite efforts, theyare only partial solutions, as they do not consider overall graphstructure information (e.g., distances, connections and directionsbetween nodes), which can play an important role in decidingwhich road to attend. In summary, any other approaches do notconsider both of dynamically adjusting attention weights and thegraph structure information.

Many models employ recurrent neural networks (RNNs) for tem-poral modeling (e.g., DCRNN [11], GaAN [28]). However, RNNshave a limitation in that they cannot directly access past featuresin long input sequences, which implies ineffectiveness in modeling

arX

iv:1

911.

1318

1v2

[cs

.LG

] 2

0 O

ct 2

020

https://doi.org/10.1145/3340531.3411940

temporal dependencies [25]. Attention-based models can be an al-ternative to resolve the issue in the RNN-based temporal modeling,directly accessing past information in the long input sequences.However, existing attention models do not consider dynamic tempo-ral dependencies among roads. For example, there are cases whereroad speed can be best predicted by attending the target roads. How-ever, existing models with attention do not consider these cases, sothat they always retrieve new information from neighbor nodes,even in unnecessary cases.

In this work, we propose a novel spatio-temporal graph attentionnetwork (ST-GRAT) for predicting traffic speed that addresses afore-mentioned weaknesses. First, we design a spatial attention moduleto model the spatial dependencies by capturing both road speedchanges and graph structure information, based on our proposeddiffusion prior, directed heads, and distance embedding. Second,we encode temporal dependencies by using attention to directlyaccess distant relevant features of input sequences without anyrestriction and to effectively capture sudden fluctuating temporaldynamics. Third, to avoid attending unrelated roads that are nothelpful for prediction, we newly design ‘spatial sentinel’ key andvalue vectors, motivated by the sentinel mixture model [13, 14].Guided by the sentinel vectors, ST-GRAT dynamically decides touse new information of other roads or focus on existing encodedfeatures. The experimental results indicate that ST-GRAT achievesstate-of-the-art performance, especially in short-term prediction.We also confirm that ST-GRAT is better than existing models atpredicting traffic speeds in situations where the road speeds areabruptly changing (e.g., rush hours). Moreover, compared to exist-ing methods, our model shows an interptetable ability by using theself-attention mechanism with the sentinel vectors. In the qualita-tive study, we conduct an interpretation of our trained model byvisualizing when and where the model directed its attention basedon different traffic conditions. Lastly, we present in-depth alaly-ses on how the newly designed components dynamically capturespatio-temporal dependencies.

The contributions of this work include:

• ST-GRAT, which consists of entire self-attention mecha-nisms to dynamically captures both spatial and temporaldependencies of input sequences over time.

• A newly proposed self-attention module with the sentinelvectors that help the model decide to focus on existing en-coded features, instead of unnecessary attending other roads,

• A spatial module that uses diffusion prior and directed headsto effectively encode graph structure,

• Quantitative experiments and comparisons with state-of-theart models on the two real-world datasets, including thedifferent time ranges and abruptly changing time ranges(e.g., rush hours), and

• In-depth analysis and interpretation on how ST-GRATworksin varying traffic conditions

2 RELATEDWORKIn this section, we review previous approaches regarding trafficprediction and attention models.

2.1 Approaches for Traffic ForecastingDeep learning models for traffic prediction usually leverage spatialand temporal dependencies of the road traffic. The graph convo-lution neural network (GCNN) [9] have been popular for spatialrelationship modeling. Given a road network, it aggregates adjacentnode information into features based on convolution coefficients.These coefficients are computed by spatial information (e.g., the dis-tance between nodes). RNNs and their variants are combined withthe encoded spatial relationship to model temporal dependencies(e.g., speed sequences) [27].

As modeling spatial correlation is a key factor for improving pre-diction performance, researchers have proposed new approachesfor effective spatial correlation modeling. For example, DCRNN [11]combines diffusion convolution [1] and recurrent neural networksto model spatial and temporal dependencies. Graph WaveNet [25]also adapts diffusion convolution in spatial modeling, but it is dif-ferent from DCRNN, as it 1) considers both connected and uncon-nected nodes in the modeling process, and 2) uses dilated convolu-tion [21] to learn long sequences of data.

Nonetheless, existing approaches use constant coefficients, whichare computed once and applied to all traffic conditions. However,the fixed coefficients may result in inaccuracies when spatial corre-lation is variable (e.g., abrupt speed changes). Compared to existingmodels, ST-GRAT improves accuracy by dynamically adjusting thecoefficients of neighboring nodes based on their present states andmore spatial information (e.g., distance, node connectivity, flowdirection).

2.2 Attention ModelsAttention-based neural networks are widely used for sequence-to-sequence modeling, such as machine translation and naturallanguage processing (NLP) tasks [4, 12, 17, 26]. Vaswani et al. pro-pose a novel self-attention network called Transformer [22], whichis able to dynamically capture diverse syntactic and semantic fea-tures of the given context by using multi-head self-attention heads.The self-attention mechanism has additional advantages comparedto conventional long-short-term memory (LSTM) [8] in that itsprocess can be easily paralleled, and it directly attends to relatedinput items regardless of the coverage of the receptive field. Dueto these advantages, Transformer has contributed to many otherNLP tasks for improving accuracy [5, 16]. Another study [23] usedthe self-attention network for graph data, demonstrating that theattention networks outperform the GCNN model. Zhang et al. [28]propose a graph attention network, replacing the diffusion convo-lution operation in DCRNN [11] with the gating attention. Thesemodels show that the graph attention model does not lag behindthe GCNN-based model in the spatio-temporal task.

While previous models can be used for replacing GCNN-basedspatial modeling, they all have a drawback; they do not considerthe information embedded in the graph structure (such as distances,connectivity, and flow directions between nodes) in their spatialdependency modeling processes in deciding when and where toattend. Compared to previous models, ST-GRAT has a novel spatialattention mechanism that can consider all of the mentioned graphstructure information.

𝑳𝒂𝒚𝒆𝒓

Spatial Attention

Spatial Attention

Spatial Attention

Temporal Attention

Feed Forward

Feed Forward

Feed Forward

Embedding Layer

Embedding Layer

Embedding Layer

𝑍𝑒0,1

𝑍𝑒1,1 𝑍𝑒

2,1

𝑍𝑒0,𝐿 𝑍𝑒

1,𝐿 𝑍𝑒2,𝐿

Spatial Attention

Spatial Attention

Spatial Attention

Masked Temporal Attention

Feed Forward

Feed Forward

Feed Forward

Embedding Layer

Embedding Layer

Embedding Layer

𝑍𝑑3,𝐿 𝑍𝑑

4,𝐿 𝑍𝑑5,𝐿

Linear

Encoder-Decoder Attention

[Start]

Linear Linear

𝑌2

𝑻𝑿𝟎 Start token 𝒀𝟏𝑿𝟏 𝑿𝟐 𝒀𝟐

𝟏

𝟐

𝑳

𝑌1 𝑌3

Figure 1: Overall architecture of ST-GRAT. The 𝑥 and 𝑦 axesindicate the time and the number of the layers, respectively.The left-half block is the 𝐿-stacked encoder while the right-half block is the 𝐿 stacked decoder. We use a special token,‘[Start]’, to represent the starting point in a decoding stage.

3 PROPOSED APPROACHIn this section, we define the traffic forecasting problem and de-scribe our spatio-temporal graph attention network.

3.1 Problem Setting for Traffic SpeedForecasting

We aim to predict the future traffic speed at each sensor location.We define the input graph as G = (V, E,A), where V is the setof all the different sensor nodes, (|V| = 𝑁 ), E is the set of edges,and A ∈ R𝑁×𝑁 is a weighted adjacency matrix. The matrix Aincludes two types of proximity information in the road network:connectivity and edge weights. Connectivity indicates whether twonodes are directly connected or not. Edge weights are comprisedof the distance and direction of the edges between two connectednodes. This proximity information refers to the overall structure ona given graph including connectivity, edge directions, and distancesof the entire nodes.

We denote 𝑋 (𝑡 ) ∈ R𝑁×2 as the input feature matrix at time 𝑡 ,where 𝑁 is the number of nodes and 2 is the number of features(the velocity and the timestamp). Following the conventional trafficforecasting problem definition, our problem is to learn a mappingfunction 𝑓 that predicts the speed of the next 𝑇 time steps (𝑌 =

[𝑋 (𝑡+1):,0 , · · · , 𝑋 (𝑡+𝑇 )

:,0 ]), given the previous 𝑇 input speeds in a se-quence (𝑋 = [𝑋 (𝑡−𝑇+1) , · · · , 𝑋 (𝑡 ) ]), and graph G, i.e., 𝑌 = 𝑓 (𝑋,G).To solve this sequence-to-sequence learning problem, we utilize anencoder-decoder architecture, as shown in Fig. 1 described in thefollowing sections.

3.2 Encoder ArchitectureGiven a sequence of observations, 𝑋 , the encoder consists of spatialattention and temporal attention for predicting the future sequence

𝑌 . As shown in Fig. 1, a single encoder layer consists of three sequen-tial sub-layers: the spatial attention layer, the temporal attentionlayer, and the point-wise feed-forward neural networks. The spatialattention layer attends to neighbor nodes spatially correlated tothe center node at each time-step, while the temporal attentionlayer attends to individual nodes and focuses on different timesteps of a given input sequence. The position-wise feed-forwardnetworks create high-level features that integrate information fromthe two attention layers. These layers consist of two sequentialfully connected networks with GELU [7] activation function.

The encoder has a skip-connection to bypass the sub-layer, andwe employ layer normalization [2] and dropout after each sub-layerto improve the generalization performance. The overall encoderarchitecture is a stack of an embedding layer and four (𝐿 = 4) identi-cal encoder layers. The encoder transforms the spatial and temporaldependencies of an input signal into a hidden representation vector,which is used later for attention layers in the decoder.

3.3 Embedding LayerUnlike GCNN-based models, attention-based GNNs [23, 28] mainlyutilize connectivity between nodes. However, conventional modelsdo not consider proximity information in their modeling process.To incorporate the proximity information, the embedding layer inST-GRAT takes a pre-trained node-embedding vector generatedby LINE [19]. The node-embedding features are used to computespatial attention, which will be further discussed in the followingsection.

The embedding layer also performs positional embedding toacquire the order of input sequences. Unlike previous methods thatuse a recurrent or convolutional layer for sequence modeling, wefollow the positional encoding scheme of the Transformer [22].Weapply residual skip connections to prevent the vanishing effect ofembedded features that can occur as the number of encoder ordecoder layers increases. We concatenate each node embeddingresult with the node features and then project the concatenatedfeatures onto 𝑑𝑚𝑜𝑑𝑒𝑙 . Lastly, we add the positional encoding vectorto each time step.

3.4 Spatial AttentionFig. 2 shows the proposed spatial attention, which consists of multi-ple inflow attention heads (odd head indices) and outflow attentionheads (even head indices). Previous attention-based GNNs [23, 28]define spatial correlation in an undirected manner. They calculateattention with respect to all neighbor nodes without consideringtheir direction in a road network. In contrast, our model differenti-ates neighbor nodes by directions of attention heads. Specifically,we divide the attention heads–i.e., odd indices are responsible for in-flow nodes, even indices are responsible for outflow nodes—whichallows the model to attend to different information for each ofinflow and outflow traffic. Fig. 2 shows the inflow attention headexample, which attends only inflow nodes and its node.

We denote the encoder hidden states asZ = [𝑧𝑖 , · · · , 𝑧𝑁 ], where𝑧𝑖 ∈ R𝑑𝑚𝑜𝑑𝑒𝑙 is the hidden state of the 𝑖-th node. We denote theset of the 𝑖-th node and its neighbor nodes as N𝑖 . We define thedimensions of the query, key, and value vectors as 𝑑𝑞 = 𝑑𝑘 = 𝑑𝑣 =

Figure 2: The proposed spatial attention mechanism. In thisexample, the inflow spatial attention takes query, key, andvalue vectors,𝑞,𝑘 , and 𝑣 , respectively.𝑘𝑠 and 𝑣𝑠 indicate a sen-tinel key and value vector. 𝑧∗ represents the output of multi-head attention, and𝐻# is an indicator of the heads. Lastly, ⊕,⊗, and ∥ indicate the element-wise sum, thematrix multipli-cation, and the concatenation operation, respectively.

𝑑𝑚𝑜𝑑𝑒𝑙/𝐻 , respectively, where 𝐻 is the number of attention headin multi-head self-attention.

To extract diverse high-level features from multiple attentionheads, we project the current node onto a query space andN𝑖 ontothe key and value spaces. The output of each attention head isdefined as a weighted sum of value vectors, where the weight ofeach value is computed from a learned similarity function of thecorresponding key with the query. However, existing self-attentionmethods have the constraint that the sum of the weights has tobe one. Hence, the query node has to attend to key-value pairs ofN𝑖 , even in situations where any spatial correlation does not existamong them.

To prevent such unnecessary attention, we add spatial sentinelkey and value vectors, which are linear transformations of a querynode. For instance, if a query node does not require any informationfrom the key-value pairs of N𝑖 , it will attend to the sentinel keyand value vectors (i.e., stick to its existing information rather thanalways attending to the given key and the value information).

Thus, the output feature of the 𝑖-th node in the ℎ-th attentionhead, 𝑜ℎ

𝑖∈ R𝑑𝑣 , is the weighted sum of the spatial sentinel value

vector and the value vectors of N𝑖 :

𝑜ℎ𝑖 =©«1 −

∑︁𝑗=N𝑖

𝛼𝑖, 𝑗ª®¬ ∗

(𝑧𝑖𝑊

𝑉𝑠ℎ

)+

∑︁𝑗=N𝑖

𝛼𝑖, 𝑗

(𝑧 𝑗𝑊

𝑉𝑛ℎ

),

where𝑊𝑉𝑠ℎ

∈ R𝑑𝑚𝑜𝑑𝑒𝑙×𝑑𝑣 and𝑊𝑉𝑛ℎ

∈ R𝑑𝑚𝑜𝑑𝑒𝑙×𝑑𝑣 indicate the lineartransformation matrices of the sentinel value vector and the valuevector of spatial attention, respectively. The attention coefficient,𝛼𝑖, 𝑗 , is computed as

𝛼𝑖, 𝑗 =exp(𝑒𝑖,𝑗 )

𝑒𝑖,𝑠+∑𝑗=N𝑖 exp(𝑒𝑖,𝑗 )

, (1)

where 𝑒𝑖, 𝑗 indicates the energy logits, and 𝑒𝑖,𝑠 represents the sentinelenergy logit.

Energy logits are computed by using a scaled dot-product of thequery vector of the 𝑖-th node and the key vector of the 𝑗-th node,i.e.,

𝑒𝑖, 𝑗 =

(𝑧𝑖𝑊

𝑄𝑁ℎ

) (𝑧 𝑗𝑊

𝐾𝑁ℎ

)𝑇√𝑑𝑘

+ 𝑃ℎ (A), (2)

where parametermatrices𝑊𝑉𝑠ℎ

∈ R𝑑𝑚𝑜𝑑𝑒𝑙×𝑑𝑞 and𝑊𝑉𝑛ℎ

∈ R𝑑𝑚𝑜𝑑𝑒𝑙×𝑑𝑘are the linear transformation matrices of the query vector and thekey vector. Moreover, to explicitly provide edge information, weinclude additional prior knowledge 𝑃ℎ (A), called diffusion prior,based on a diffusion process in a graph. The diffusion prior indicateswhether the attention head is an inflow attention or an outflowattention, defined as

𝑃2𝑚+1 (A) = ∑𝐾𝑘=0 𝛽

𝑘ℎ∗ (𝐷−1

𝐼A𝑇 )𝑘

𝑃2𝑚 (A) = ∑𝐾𝑘=0 𝛽

𝑘ℎ∗ (𝐷−1

𝑂A)𝑘 , (3)

where 𝐾 is the number of truncation steps of the diffusion process.𝐷𝐼 and 𝐷𝑂 are the in-coming diagonal matrix and out-going diago-nal matrix respectively.

(𝐷−1𝑂

A)and

(𝐷−1𝐼

A)denote the out-going

and the in-coming state transition matrices. 𝛽𝑘ℎis the weight of the

diffusion process at step 𝑘 in the ℎ-th attention head, which is alearnable parameter at each layer of the attention head.

Calculating the sentinel energy logit is similar to other energylogits, but it excludes prior knowledge and uses a sentinel keyvector instead. Hence, it is defined as follows:

𝑒𝑖,𝑠 =

(𝑧𝑖𝑊

𝑄𝑁ℎ

) (𝑧𝑖𝑊

𝐾𝑠ℎ

)𝑇√𝑑𝑘

, (4)

where𝑊𝐾𝑠ℎ

∈ R𝑑𝑚𝑜𝑑𝑒𝑙×𝑑𝑘 is a linear transformation matrix of thesentinel key vector. For example, If 𝑒𝑖,𝑠 is higher than

∑𝑒𝑖, 𝑗 , the

model will assign less attention to the nodes in N𝑖 nodes.After computing the output features 𝑜ℎ

𝑖on each attention head,

they are concatenated and projected as

𝑧∗𝑖 = Concat(𝑜1𝑖 , ..., 𝑜𝐻𝑖 )𝑊𝑂𝑁 , (5)

where𝑊𝑂 ∈ R𝑑𝑚𝑜𝑑𝑒𝑙×𝑑𝑚𝑜𝑑𝑒𝑙 is the projection layer. The projec-tion layer helps the model to combine various aspects of spatial-correlation features and the outputs of the inflow and outflow at-tention heads.

3.5 Temporal AttentionThere are two major differences between temporal and spatial atten-tion: 1) temporal attention does not use the sentinel vectors and thediffusion prior, and 2) temporal attention attends to important timesteps of each node, while spatial attention attends to importantnodes at each time step. However, temporal attention is similarto spatial attention in that it uses multi-head attention to capturethe diverse representation in the query, key, and value spaces. Weutilize the multi-head attention mechanism, which is proposed inTransformer [22].

Note that the temporal attention layer can directly attend tofeatures across time steps without any restriction in accessing in-formation in the input sequence, which is different from previousapproaches [11, 28] that cannot access features at distant time steps.

Table 1: Prediction accuracy on METR-LA

T Metric GCRNN DCRNN GaAN STGCN Graph WaveNet HyperST GMAN ST-GRAT

MET

R-LA

15 minMAE 2.80 2.73 2.71 2.88 2.69 2.71 2.81 2.60RMSE 5.51 5.27 5.24 5.74 5.15 5.23 5.55 5.07MAPE 7.5% 7.12% 6.99% 7.62% 6.90% - 7.43% 6.61%

30 minMAE 3.24 3.13 3.12 3.47 3.07 3.12 3.12 3.01RMSE 6.74 6.40 6.36 7.24 6.26 6.38 6.46 6.21MAPE 9.0% 8.65% 8.56% 9.57% 8.37% - 8.35% 8.15 %

1 hourMAE 3.81 3.58 3.64 4.59 3.53 3.58 3.46 3.49RMSE 8.16 7.60 7.65 9.40 7.37 7.56 7.37 7.42MAPE 10.9% 10.43% 10.62% 12.70% 10.01% - 10.06% 10.01%

AverageMAE 3.28 3.14 3.16 3.64 3.09 3.13 3.13 3.03RMSE 6.80 6.42 6.41 7.46 6.26 6.39 6.46 6.23MAPE 9.13% 8.73% 8.72% 9.96% 8.42% - 8.61% 8.25%

Table 2: Summary of experiment results on PEMS-Baydatasets.

T Metric DCRNN STGCN Graph WaveNet GMAN ST-GRAT

PEMS-Ba

y

15 minMAE 1.38 1.36 1.30 1.36 1.29RMSE 2.95 2.96 2.74 2.93 2.71MAPE 2.9% 2.9% 2.73% 2.88% 2.67%

30 minMAE 1.74 1.81 1.63 1.64 1.61RMSE 3.97 4.27 3.70 3.78 3.69MAPE 3.9% 4.17% 3.67% 3.71% 3.63%

1 hourMAE 2.07 2.49 1.95 1.90 1.95RMSE 4.74 5.69 4.52 4.40 4.54MAPE 4.9% 5.79% 4.63% 4.45% 4.64%

AverageMAE 1.73 1.88 1.63 1.63 1.62RMSE 3.88 4.30 3.65 3.70 3.65MAPE 3.9% 4.28% 3.67% 3.68% 3.65%

3.6 Decoder ArchitectureThe overall structure of the decoder is similar to that of the encoder.The decoder consists of the embedding layer and four other sub-layers: the spatial attention layer, two temporal attention layers,and the feed-forward neural networks. After each sub-layer, layernormalization is applied. One difference between the encoder anddecoder is that the decoder contains two different temporal atten-tion layers–the masked attention layer and the encoder-decoder(E-D) attention layer. The masked attention layer masks future timestep inputs to restrict attention to present and past information.The encoder-decoder attention layer extracts features by using theencoded vectors from both the encoder and the masked attentionlayer. In this layer, the encoder-decoder attention from the encoderis used as both the key and the value, and query features of eachnode are passed along with the masked self-attention. Finally, thelinear layer predicts the future sequence.

4 EVALUATIONWe present our experimental results on two real-world large-scaledatasets–METR-LA and PEMS-BAY released by [11].

The METRLA and PEMS-BAY datasets contain speed data forthe period of four months from 207 sensors and six months from325 sensors, gathered on the highways of Los Angeles County and

in the Bay area, respectively. We pre-process the data to have a five-minute interval speed and timestamp data, replace missing valueswith zeros, and apply the z-score and the min-max normalization.We use 70% of the data for training, 10% for validation, and therest for testing. Our pre-processing follows the approach used inDCRNN [11] 1. We present a comprehensive experiment result withthe METR-LA dataset, as the LA area’s traffic conditions are morecomplicated than those in the Bay area [11].

4.1 Experimental SetupST-GRAT predicts the speeds of the next 12 time steps from presenttime (five-minute intervals, one-hour in total) based on the inputspeeds of the previous 𝑇 = 12 time steps. We train a four-layerspatio-temporal attention model (𝐻 = 4, 𝑑𝑚𝑜𝑑𝑒𝑙 = 128). To reducethe discrepancy between the training and testing phase, We utilizean inverse sigmoid decay scheduled sampling [3].

For optimization, we apply Adam-warmup optimizer [22] andset the warmup step size and batch size as 4,000 and 20, respectively.We use dropout (rate: 0.3) [18] on the inputs of every sub-layerand on the attention weights. We initialize the parameters by usingXavier initialization [6] and use a uniform distribution,𝑈 (1, 6), toinitialize the weights of the diffusion prior. LINE [19] is used fornode embedding with dimension 64, which takes two minutes fortraining with 20 threads.We used a distance between two connectednodes and correlation using Vector Auto-Regression (VAR) [31] asan edge weight.

4.2 Experimental Resultswe compare the performance of ST-GRAT with six baseline models,including state-of-the-art deep learning models: [Graph Convolu-tion based RNN (GCRNN), DCRNN [11], GaAN [28], STGCN [27],Graph WaveNet [25] GMAN [30] and HyeperST [15]]. We repro-duce DCRNN 2, Graph Wavenet 3, and GMAN 4, using the sourcecodes and hyper parameters published by the authors. When the1https://github.com/liyaguang/DCRNN/tree/master/data2https://github.com/liyaguang/DCRNN/3https://github.com/nnzhan/Graph-WaveNet4https://github.com/zhengchuanpan/GMAN

https://github.com/liyaguang/DCRNN/tree/master/data

https://github.com/liyaguang/DCRNN/

https://github.com/nnzhan/Graph-WaveNet

https://github.com/zhengchuanpan/GMAN

… … … … … … … … … … …

Spee

d (m

ph)

… …

0

10

20

30

40

50

60

70

80

0

10

20

30

40

50

60

70

80

175 150 225 200 125 150 200 150 175 125 250 225 150(min) (min)

Figure 3: Two examples of the intervals extracted by rup-tures with the y-axis showing the speed and the x-axis show-ing the duration for each interval.

source code is not available, we use the original results from the pa-per for comparison. A few papers are not only unavailable to accessofficial codes but also unpublished results in PEMS-Bay(Table 2).In the experiment, we measure the accuracy of the models usingabsolute error (MAE), root mean squared error (RMSE), and meanabsolute percentage error (MAPE). The task is to predict the trafficspeed 15, 30 and 60 minutes from present time. We also reportaverage scores of three forecasting horizons on the dataset.

Table 1 and Table 2 show the experimental results of both datasets,where we observe that ST-GRAT achieves state-of-the-art perfor-mance in the average scores. In particular, ST-GRAT excels at pre-dicting speed after 15 and 30 minutes. ST-GRAT shows higher accu-racy than GaAN, which is based on undirected spatial attention butneglects the proximity information. ST-GRAT also achieves higherperformances compared to DCRNN and GraphWavenet, demon-strating that our spatial and temporal attention mechanism is moreeffective than that of the two models for predicting short-termfuture sequences.

We further evaluate ST-GRAT from several perspectives. First,we compare the forecasting performance of the models in separatetime ranges. We do this because traffic congestion patterns in a citychange dynamically over time. For example, some roads aroundresidential areas are congested during regular rush hour periods,while those around industrial complexes are congested from latenight to early morning [10]. Table 3 shows the experimental resultsfor four time ranges (00:00–05:59, 06:00–11:59, 12:00–17:59, and18:00–23:59) and we find that ST-GRAT performs better than GraphWaveNet.

Secondly, to evaluate how ST-GRAT adapts to speed changesduring rush hour times, we extract intervals of speeds from thedata where road speeds change rapidly. We use “ruptures" [20], analgorithm for computing changing points in a non-stationary signal.Hyper-parameters of ruptures are 10, 1, and 6 for penalty value,jump, and minimum size, respectively. Fig. 3 shows the extractedintervals of two example roads where the y-axis represents speedand the x-axis represents the sequence of intervals. To find theintervals related to traffic congestion, we filter out intervals inwhich the slowest speeds are slower than 20 mph. Table 3 showsthe performance comparison between ST-GRAT, Graph WaveNetand DCRNN, and we observe that our model captures the temporaldynamics of speed better than the others.

Third, we visualize the traffic speed of each model on line chartsto illustrates the traffic speed prediction patterns during impeded

ST-GRAT

Ground TruthGraph WaveNet

DCRNNGMAN

2012-06-19 17:45 18:10 18:35 19:00

2012-06-11 18:05 18:30 18:55 19:20 19:55

(A)

(B)

60

50

40

30

20

10

50

40

30

20

10

Figure 4: Traffic speed prediction visualization in the im-peded intervals. ST-GRAT generates more accurate predic-tion and is robustness, especially when speeds abruptlychange.

time intervals in the METR-LA dataset. There is no time lag be-tween ground truth and ST-GRAT prediction, while predictiontraffic speed of other baseline models follow ground truth afterreduced traffic speed as shown in Figure 4 (A). The ST-GRAT ismore accurate with the abruptly changes and impeded interval thanbaseline models as shown in Figure 4 (B). This is because ST-GRATexploits the overall graph structure information and effectivelyencodes temporal dynamics.

4.3 Ablation StudyWe conduct an ablation study to understand the impact of differenthyper-parameters. Table 4 shows the experimental results where 𝐿,𝑑𝑚𝑜𝑑𝑒𝑙 , and 𝐻 denote the number of layers, dimension of hiddenstate features, and number of heads, respectively. “Embedding” and“Range” indicate different embedding methods and attention stepsizes. If the range is one, it means that the model only attends nodeswithin a single step, such as directly connected nodes. If the rangeis two, it means the model attends nodes within two steps (i.e.,directly linked nodes and their neighbors). The hyphens denoteexcluded parameters. “Directed” indicates that the input graph is adirected graph. Finally, “Prior” and “Sentinel” indicate whether themodel uses a diffusion prior and sentinel vector or not. We showthe average MAE prediction results from five-mins to one-hourmeasured by each evaluation metric. Other settings are identical tothose used in subsection 4.1.

According to row (A), we observe that the number of layersproportionally affects the performance of our model. A deep modelhas better performance than a shallow model, because having asmall number of layers causes the model to underfit. In row (B),we find that the dimension of the hidden vector also impacts thefinal performance. When the dimension of the hidden state vectorincreases, the key and query dimension also increases, allowing

Table 3: Experimental results with different time ranges and intervals.

Model / T 15 min 30 min 1 hour AverageMAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE

Our Model (00-06) 2.37 3.57 4.30% 2.41 3.67 4.44% 2.45 3.77 4.53% 2.38 3.63 4.37%Graph WaveNet (00-06) 2.41 3.57 4.38% 2.45 3.68 4.50% 2.52 3.75 4.61% 2.44 3.65 4.47%DCRNN (00-06) 2.4 3.58 4.36% 2.44 3.72 4.52% 2.48 3.8 4.6% 2.43 3.68 4.47%GMAN (00-06) 2.39 3.59 4.35% 2.41 3.69 4.43% 2.43 3.75 4.48% 2.40 3.66 4.40%Our Model (06-12) 2.55 5.31 6.93% 3.07 6.72 8.82% 3.68 8.16 10.93% 3.01 6.51 8.63%Graph WaveNet (06-12) 2.67 5.43 7.55% 3.19 6.79 9.42% 3.73 8.01 11.28% 3.11 6.55 9.15%DCRNN (06-12) 2.74 5.67 7.59% 3.28 7.16 9.6% 3.9 8.64 11.68% 3.22 6.95 9.35%GMAN (06-12) 2.79 5.84 7.78% 3.22 7.00 9.27% 3.69 8.16 10.82% 3.17 6.83 9.09%Our Model (12-18) 3.14 6.16 9.34% 3.80 7.69 12.02% 4.55 9.29 15.53% 3.72 7.47 11.88%Graph WaveNet (12-18) 3.29 6.23 10.25% 3.80 7.49 12.41% 4.48 8.85 15.39% 3.77 7.31 12.28%DCRNN (12-18) 3.33 6.43 10.11% 3.95 7.83 12.58% 4.66 9.33 15.66% 3.80 7.64 12.39%GMAN (12-18) 3.48 6.87 10.78% 3.96 7.99 12.78% 4.48 9.12 15.26% 3.90 7.81 12.65%Our Model (18-24) 2.34 4.77 5.69% 2.72 5.86 7.00% 3.21 7.10 8.60% 2.69 5.75 6.91%Graph WaveNet (18-24) 2.41 4.83 6.09% 2.78 5.87 7.47% 3.19 6.90 8.99% 2.73 5.71 7.33%DCRNN (18-24) 2.48 5.09 6.24% 2.85 6.19 7.79% 3.27 7.33 9.53% 2.81 6.04 7.66%GMAN (18-24) 2.52 5.23 6.49% 2.82 6.16 7.78% 3.14 7.07 9.10% 2.78 6.03 7.65%Our Model (Impeded Interval) 5.82 9.97 29.0% 7.94 13.31 41.20% 10.56 16.92 57.96% 7.81 13.32 41.04%Graph WaveNet (Impeded Interval) 6.45 10.69 33.19% 8.62 14.10 45.89% 11.14 17.65 62.79% 8.44 14.05 45.49%DCRNN (Impeded Interval) 6.44 10.41 33.75% 8.39 13.52 45.22% 10.71 16.76 60.93% 8.22 13.45 44.91%GMAN (Impeded Interval) 6.94 11.55 36.05% 8.76 14.41 46.93% 10.74 17.22 60.38% 8.60 14.29 46.53%

Table 4: Ablation study of ST-GRAT with the METR-LA val-idation set.

𝐿 𝑑𝑚𝑜𝑑𝑒𝑙 Embedding 𝐻 Range Directed Prior Sentinel MAEOur model 4 128 Distance 4 2 ✔ ✔ ✔ 2.80

(A) 2 3.023 2.98

(B) 64 2.89

(C) – 2.97Random 2.90

(D) 2 3.018 2.88

(E) 1 2.98(F) – 2.87(G) – 2.85(H) – 2.89

more information to be encoded. This motivates the model to betterconsider the elaborate relation between the query and key, leadingto higher performance than a model with a small dimension ofhidden vectors. Moreover, as we expected, we observe a positivecorrelation between the number of heads and the performanceof the model in row (D). This is because as the number of headsincreases, the model can capture more diverse aspects of spatial de-pendency than a model with a small number of heads. For example,if the model only has two heads, the first head might only considerthe nodes that are experiencing traffic congestion while the secondhead only focuses on nodes that have light traffic.

Row (C) shows the results of different node embedding settingsin comparison to our model, which utilize the distance betweennodes as the node-embedding. We remove or add the embeddinglayer with different strategies as random initialization. Since theremoved embedding layer and random initialization embeddingdoes not have proximity information, both embedding layer modelsimproperly perform attention, failing to attend more on close nodes

than other distant nodes. By comparing different embedding set-tings, we observe that proximity information is hardly influencingthe performance of our model.

From row (E), we show that the performance is also affected bythe range of neighbor nodes that carry out the spatial attention pro-cess. The performance of 𝑅𝑎𝑛𝑔𝑒 = 2 is better than that of 𝑅𝑎𝑛𝑔𝑒 = 1because the model with 𝑅𝑎𝑛𝑔𝑒 = 2 considers more neighbor nodesthan the model with 𝑅𝑎𝑛𝑔𝑒 = 1. However, as the number of 𝑅𝑎𝑛𝑔𝑒increases, the computation cost of nodes (i.e., scaled-dot productoperation) also increases.

Row (F) shows the effect of directed attention (inflow and out-flow attention). In undirected attention, the spatial attention needsto compare the similarity of the query and key to all in-coming orout-going nodes from a particular node. As a result, even thoughundirected attention takes more nodes into consideration, its per-formance is lower than our model which uses directed attention.This experiment shows that our directed attention, which splits theheads into inflow and outflow, is suitable to solve traffic problems.Row (G) indicates whether the model applies the diffusion prior.The base model performance is better than the model that doesnot apply the diffusion prior. This is because the model adjusts theattention logits by utilizing a diffusion prior.

Row (H) shows the effect of a sentinel key-value vector, whichis described in the spatial attention section. We find that the modelthat uses the sentinel vectors is better than the model that does notuse sentinel vectors. This result indicates that the sentinel vectorshelp the model to improve robustness in the traffic prediction byavoiding unnecessary attentions.

5 QUALITATIVE ANALYSISIn this section, we describe how ST-GRAT captures spatio-temporaldependencies within a road network. Fig. 5 (A1) shows speed

Head#1 (In)

Head#2 (Out)

Head#3 (In)

Head#4 (Out)

B1Node: 115, Time : 16:00 (T1)

Spat

ial A

ttn

0.0 0.2 0.4 0.6 0.8 1.0

54 141 175

201

122 168

B2111 145

52 141

Node: 115, Time : 17:05 (T2)

Head#1 (In)

Head#2 (Out)

Head#3 (In)

Head#4 (Out)Spat

ial A

ttn

201

Node: 115, Time : 19:35 (T3)B3

62

SentinelHead#1 (In)

Head#2 (Out)

Head#3 (In)

Head#4 (Out)Spat

ial A

ttn 125

199

168 Sentinel

16:00 (T1)

Encode

rTemp-Attn

Decode

rTemp-Attn

E-D

Temp-Attn

Head#2

Head#2

Head#2

D1

……

…D3 18:10 (T3)17:05 (T2)D2

C

……

…

Node: 115

T1

A1

GoundTruth

Our Model

DCRNN

T2

T3

16:00

17:05

19:35

20 16:00 16:30 17:00 17:30 18:00 18:30 19:00 19:30 20:00

Node: 145A2

GoundTruth

Our Model

DCRNN

20 16:00 16:30 17:00 17:30 18:00 18:30 19:00 19:30 20:00

T1

T2

T3

17:05

19:3516:00

Encoder Temp-Attn

0

6

11

0 6 11

-55min Present

-55m

inPr

esen

t

Decoder Temp-Attn

6

11

0+5m

in+6

0min

+5min +60min

E-D Temp-Attn

0

6

11

0 6 11

+5m

in+6

0min

-55min Present

0 6 11

Figure 5: (A) Two line charts of the current node and neighbor node, (B) Spatial-attention heatmaps in different time steps, and(C) Temporal-attention heatmaps in different time steps atMETR-LA dataset in impeded interval (2012/06/20 16:05–2012/06/2020:10).

changes of Node 115, where the x- and y-axes represent time andspeed (mph), respectively. As shown in Fig. 5 (A1), this road wascongested from 16:50 and congestion was alleviated at approxi-mately 19:35 (i.e., a typical congestion pattern in the evenings). Theheatmaps of Fig. 5 (B1–B3) provide information on spatial attentionof the last layer at different times (B1–T1, B2–T2, B3–T3). The y-axisat each head (in-out-in-out degrees) represents 12 time sequences,and the x-axis indicates the 207 nodes and the sentinel vector (lastcolumn).

First, we find that ST-GRAT gave more attention to six nodes(54, 122, 141, 168, 175, and 201) than others in light traffic (T1 inFig. 5 (A1)). Then, as time passes from T1 to T2, Nodes 54 and 175gradually received less attention, and Nodes 52, 111, 145, and 201gainedmore attention, as shown in Fig. 5(B2). A notable observationis that ST-GRAT attended to Node 145 and Node 201 (i.e., dark bluebars) more than other nodes. We review speed records of Node 145(Fig. 5, A2) due to its strong attention and find that the speed ofNode 145 tended to precede that of Node 115 by about 30 minutes(Fig. 5 (A2)). In checking the correlations between pairs of nodesbased on the VAR, we find that the speed pattern of Node 115 washighly correlated with that of Nodes 145 and 201 (0.59 and 0.56,respectively), while it was less correlated with other nodes (e.g.,0.48 with Node 52). An interesting point here is that the distancesfrom the two nodes to Node 115 are different (0.9 miles from Node145 and 7.64 miles from Node 201). This result shows that ST-GRATlearned dynamically changing spatial dependencies along with thegraph structure information such as correlations and distances.

When the traffic congestion was alleviated at 19:35 (Fig. 5, T3),ST-GRAT attended to its sentinel vector of Head#1 (In) and Head#4(Out), as shown in Fig. 5 (B3). This implies that ST-GRAT decided toutilize the existing features extracted from 18:45 to 19:05 by attend-ing sentinel vector. ST-GRAT also attended to Node 125 (correlation:0.43) for updating spatial dependency from a neighboring road (2.7miles away from Node 115). We see a similar behavior in Head#4

(Out)–ST-GRAT focused on the sentinel vector, while attending to adistant Node 168 (correlation: 0.59, 7.8 miles away from Node 115).In the spatial attention, ST-GRAT dynamically uses new featuresfrom spatially correlated nodes or preserves existing features.

Next, we analyze the second head of the first layer in the encoder,decoder, and encoder-decoder (E-D) temporal attention (Fig. 5 (D1)).Note that Fig. 5(C) shows how the input and output sequences aremapped with each axis of the attention heatmaps in D1–D3. Wesee from Fig. 5 D1 that when Node 115 was not congested (T1), thetemporal attention of the encoder and the decoder were equallyspread out across all time steps. However, it is interesting thatthe temporal attention of the encoder gradually became dividedinto two regions (top-left and bottom-right) as the road becamecongested, as shown in a series of the heatmaps in Fig. 5 (D2). Here,the top-left region represents the information of past unimpededconditions, while the other region contains information on thecurrent impeded conditions. We believe this divided attention isreasonable, as ST-GRAT needed to consider two possible futuredirections at the same time–one with static congestion and anotherwith a changing condition, possibly with a recovering speed. Itis notable that the first column of the temporal attention of thedecoder became darker (D2), which implies that ST-GRAT attendedto recent information in the impeded condition. Overall, ST-GRATuses the attention module in an adaptive manner with evolvingtraffic conditions for effectively capturing the temporal dynamicsof traffic.

5.1 Attention Patterns in Different TrafficConditions

In order to accurately model the spatial dependencies among roads,models need to dynamically adjust spatial correlations based ondifferent traffic conditions (e.g., impeded condition) and the graph

Miles

Avg.speed: 61mile/h

6 AM 6 PM

118

6 AM 6 PM

Avg.speed: 18mile/h

10

205

122

20

112

53

72

146

A

B

C

Figure 6: An example spatio-temporal dependencymodeling based on node speeds and distance between query and key nodes.(A) Node 72’s speed trend; (B) nodes attended by Node 72 (yellow). Red and blue nodes are attended at 6am (unimpeded) and6pm (impeded), respectively; and (C) an overall distribution of distances between the query nodes and corresponding keynodes. Node 72 attended the nodes closed to itself as the road became congested (p-value: ≪ 0.01).

structure information, such as the connectivity, directions, and dis-tances between nodes. However, previousmodels do not fully utilizethe graph structure information in the spatial modeling [28, 30].In cotrast, ST-GRAT incorporates graph structure features in thespatial attention by using diffusion priors and distance embeddingand directed heads. In this section, we describe how ST-GRAT dy-namically adjusts spatial correlations based on road conditions.

Fig. 6 describes an example attention pattern with Node 72. Asshown in Fig. 6 (A), Node 72 is not congested early in 6 am (averagespeed: 61 miles/h), while it is impeded at 6 pm (average speed: 18miles/h). Fig. 6 (B) shows that during the impeded traffic condition(6 pm), Node 72’s key nodes (blue dots) are much closer to thequery Node 72 when compare to the unimpeded traffic condition(6 am) key nodes (red dots). Note that this is an magnified map,and key nodes with attention weights less than 0.1 are filtered out.Fig. 6 (C) shows the overall distance distribution of key nodes of144 query nodes on June 4, 2012, 6 am and 6 pm. We choose the144 query nodes that have less than 30 miles/h at the time and pickthe longest distance between query and key nodes. We find thatquery nodes use the information of key nodes at significantly closerdistances (average distance: 326.4 miles, STD: 206.7) in the impededcondition, compared to the unimpeded condition(average distance:414.3, std: 186,3), according to the independent two sample t-testresult (t[143]=3.3, p-value<0.01). This result shows that ST-GRATpredicts road speeds by considering the road speeds, distancesbetween query and key nodes, and the road network, which leadsto a better prediction performance than the existing models.

5.2 Analysis of Sentinel VectorsTo analyze how sentinel vectors work in the speed prediction tasks,we investigate the nodes that extensively utilize the vectors. Forthis, we first choose the nodes that have a sentinel attention weight(𝛼𝑖,𝑠 ) higher than 0.35 (the upper 10% of the entire nodes), as shownin Fig. 7 as blue nodes. We then find it interesting that the nodeswith high sentinel weights tend to be locate on the outskirts of thecity. We also observe that they do not have many neighbor nodescompared to those in other areas (e.g., at the center of the city).Thus they do not have a sufficient pool of nodes with high spatialcorrelations for speed prediction. This means if a model is forcedto encode new information based on spatial encoding, it is highly

Figure 7: Highly affected nodes by sentinel vectors inMETR-LA data. We visualize a sensor distribution of METR-LA dataset. Especially, we mark highly affected nodesby sentinel vector as blue color where sentinel attentionweight(𝛼𝑖,𝑠 ) of the blue node has more than 0.35.

Table 5: The computation times on the METR-LA dataset.

Computation Time DCRNN GaAN Graph Wavenet GMAN ST-GRATTraining (s/epoch) 504.4 1461.4 203.89 552.1 341.7Inference (s) 34.0 131.10 8.42 50.02 48.67

possible that it uses information of few other nodes, which may notbe helpful for prediction. We also think that this attention strategycould negatively affect the performance of neighbor nodes that usethis node as a key node.

To avoid such situations, ST-GRAT utilizes the sentinel vectorsthat contain features acquired in previous steps. To measure theeffects of the vectors in such nodes, we compare ST-GRAT to its ver-sion without the sentinel vectors (ST-GRAT-NoSentinelVector) andfind that ST-GRAT shows higher performance on the METR-LA testdata (ST-GRAT: 2.27, ST-GRAT-NoSentinelVector: 2.34). As such,we believe that ST-GRAT improves the prediction performance byleveraging the sentinel vectors.

5.3 Computation TimeIn this section, we report the computation costs of the models withthe METR-LA dataset, as shown in Table 5. In terms of the trainingtime, we find that ST-GRAT is faster (𝑂 (1)) than RNN-based models,such as DCRNN and GaAN which have𝑂 (𝑁 ) time complexity [22].Comparing ST-GRAT to the attention models, we find ST-GRAT isabout four and three times faster than GaAN in the training andinference stages, respectively. ST-GRAT is also faster that GMANwhich needs to consider more roads, as it does not uses the connec-tivity information between nodes. The Graph WaveNet performsbest because it is a non-autoregressive model. Overall, ST-GRAT isthe second best model in terms of the training time cost, and showsan average performance in terms of the inference time. However,ST-GRAT is more robust than other models in the impeded condi-tions (Table 3), which are more difficult to predict. Also, ST-GRATprovides interpretability (Fig. 5), which is an additional advantageover previous black-box models, such as GCNN-based models.

6 CONCLUSIONSIn this work, we presented ST-GRATwith a novel spatial and tempo-ral attention for accurate traffic speed prediction. Spatial attentioncaptures the spatial correlation among roads, utilizing graph struc-ture information, while temporal attention captures the temporaldynamics of the road network by directly attending to features inlong sequences. ST-GRAT achieves the state-of-the-art performancecompared to existing methods on the METR-LA and the PEMS-BAYdatasets, especially when speeds dynamically change in the short-term forecasting. Lastly, we visualize when and where ST-GRATattends when making predictions during traffic congestion. As fu-ture work, we plan to conduct further experiments using ST-GRATwith different spatio-temporal domains and datasets, such as airquality data.

ACKNOWLEDGMENTThis work was supported by Institute of Information & commu-nications Technology Planning & Evaluation(IITP) grant fundedby the Korea government(MSIT)–No.20200013360011001, ArtificialIntelligence graduate school support(UNIST), and No.2018-0-00219,Space-time complex artificial intelligence blue-green algae predic-tion technology based on direct-readable water quality complexsensor and hyperspectral image. Sungahn Ko (UNIST) and JaegulChoo (KAIST) are the corresponding authors.

REFERENCES[1] James Atwood and Donald F. Towsley. 2016. Diffusion-Convolutional Neural

Networks. In Proc. the Advances in Neural Information Processing Systems (NIPS).[2] Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization.

abs/1607.06450 (2016).[3] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled

sampling for sequence prediction with recurrent neural networks. In Proc. theAdvances in Neural Information Processing Systems (NIPS).

[4] Kyunghyun Cho, Dzmitry Bahdanau, Farhadi, and Yoshua Bengio. 2014. Neu-ral Machine Translation by Jointly Learning to Align and Translate. Proc. theInternational Conference on Learning Representations (ICLR) (2014).

[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. InProc. the Annual Meeting of the Association for Computational Linguistics (ACL).

[6] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of trainingdeep feedforward neural networks. In Proceedings of the Thirteenth International

Conference on Artificial Intelligence and Statistics (Proceedings of Machine LearningResearch), Vol. 9. PMLR, 249–256.

[7] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs).arXiv:1606.08415 (2016).

[8] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-termmemory. Neuralcomputation 9, 8 (1997), 1735–1780.

[9] Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification withGraph Convolutional Networks. Proc. the International Conference on LearningRepresentations (ICLR) (2016).

[10] Chunggi Lee, Yeonjun Kim, Seungmin Jin, Dongmin Kim, Ross Maciejewski,David Ebert, and Sungahn Ko. 2019. A Visual Analytics System for Exploring,Monitoring, and Forecasting Road Traffic Congestion. IEEE Transactions onVisualization and Computer (2019).

[11] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diffusion ConvolutionalRecurrent Neural Network: Data-Driven Traffic Forecasting. Proc. InternationalConference on Learning Representations (ICLR).

[12] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang,Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentenceembedding. Proc. International Conference on Learning Representations (ICLR)(2017).

[13] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. KnowingWhento Look: Adaptive Attention via a Visual Sentinel for Image Captioning. Proc.the IEEE conference on computer vision and pattern recognition(CVPR) (2016).

[14] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017.Pointer Sentinel Mixture Models. Proc. the International Conference on LearningRepresentations (ICLR) (2017).

[15] Zheyi Pan, Yuxuan Liang, Junbo Zhang, Xiuwen Yi, Yingrui Yu, and Yu Zheng.2018. HyperST-Net: Hypernetworks for Spatio-Temporal Forecasting. Proc. theAssociation for the Advancement of Artificial ntelligence (AAAI) (2018).

[16] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAIBlog (2019).

[17] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016.Bidirectional attention flow for machine comprehension. Proc. the InternationalConference on Learning Representations (ICLR) (2016).

[18] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks fromOverfitting. Journal of Machine Learning Research (JMLR) (2014).

[19] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.2015. Line: Large-scale information network embedding. In Proc. the InternationalConference on World Wide Web (WWW). 1067–1077.

[20] Charles Truong, Laurent Oudre, and Nicolas Vayatis. 2018. ruptures: changepoint detection in Python. arXiv preprint arXiv:1801.00826 (2018).

[21] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, OriolVinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.[n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech SynthesisWorkshop. 125–125.

[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is AllYou Need. In Proc. the Advances in Neural Information Processing Systems (NIPS).

[23] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLió, and Yoshua Bengio. 2018. Graph Attention Networks. Proc. InternationalConference on Learning Representations (ICLR) abs/1710.10903 (2018).

[24] Eleni I Vlahogianni, Matthew G Karlaftis, and John C Golias. 2014. Short-termtraffic forecasting: Where we are and where we’re going. Transportation ResearchPart C: Emerging Technologies 43, Part 1 (2014), 3–19.

[25] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019.Graph WaveNet for Deep Spatial-Temporal Graph Modeling. In Proc. the Interna-tional Joint Conference on Artificial Intelligence (IJCAI).

[26] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan R.Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell:Neural Image Caption Generation with Visual Attention. In Proc. the InternationalConference on Machine Learning (ICML).

[27] Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018. Spatio-temporal Graph Convo-lutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proc.the International Joint Conference on Artificial Intelligence (IJCAI).

[28] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung.2018. GaAN: GatedAttentionNetworks for Learning on Large and SpatiotemporalGraphs. In Proc. the conference on uncertainty in artificial intelligence (UAI).

[29] Zheng Zhao, Weihai Chen, Xingming Wu, Peter CY Chen, and Jingmeng Liu.2017. LSTM network: a deep learning approach for short-term traffic forecast.IET Intelligent Transport Systems 11, 2 (2017), 68–75.

[30] Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. 2019. Gman:A graph multi-attention network for traffic prediction. Proc. the Association forthe Advancement of Artificial ntelligence (AAAI) (2019).

[31] Eric Zivot and Jiahui Wang. 2006. Vector autoregressive models for multivariatetime series. Modeling Financial Time Series with S-Plus® (2006).

Documents

ST-GRAT: A Novel Spatio-temporal Graph Attention Network