Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
MULTI-SOURCE INFORMATION BASED SHORT-TERM TAXI PICK-UP DEMAND PREDICTION USING DEEP-LEARNING APPROACHES
A Thesis Presented
By
Ziyan Chen
to
The department of Civil and Environmental Engineering
in partial fulfillment of the requirements
for the degree of
Master of Science
In the field of
Transportation
Northeastern University
Boston, Massachusetts
May 2020
ii
ACKNOWLEDGMENTS I would like to express my immerse gratitude to my advisors and friends who contributed
to this thesis. First of all, I would like to thank my thesis supervisor, Prof. Haris N.
Koutsopoulos, for his excellent guidance and unwavering support all the time. His
insightful advices greatly helped me during the research and writing of this thesis.
I am indebted to my instructor, Dr. Zhenliang (Mike) Ma. Mike is always patient,
enthusiastic and caring for us students. Besides valuable guidance on my academy and
research, his understanding and support gave me great confidence to overcome the
difficulties during the research journey.
Thanks to numerous people who indirectly contributed to this thesis. They are the
Ph.D. students and visiting students of Prof. Haris N. Koutsopoulos, as well as the Ph.D.
student in Monash University, Bo Wang.
Finally, I want to give my deepest gratitude and appreciation to my family. I am
lucky to have my parents, Yu Chen and Huiyun Jiang, who brought me up and taught me
honesty, kindness and responsibility. I can’t get a chance to study here in Northeastern
University from my homeland, China without their support.
iii
ABSTRACT Short-term demand prediction is of great importance to the on-demand ride-hailing services.
Predicted demand information can facilitate efficient operations and improve service
performance. This thesis proposes a multi-source information based spatiotemporal neural
network (MSI-STNN) deep learning architecture to predict short-term taxi pick-up demand.
It fuses pick-up and drop-off time-series data, weather information, location popularity
data, using three deep-learning models, including stacked convolutional long short-term
memory (Conv-LSTM) model, stacked long short-term memory (LSTM) model, and a
convolutional neural network (CNN) model. Conv-LSTM captures the spatiotemporal
features of pick-up and drop-off time series. LSTM extracts weather information while
CNN incorporates popularity data. A case study is performed to predict short-term pick-up
demand at zonal levels 15 minutes using Manhattan, New York taxi data. The results
validate the superiority of the proposed approach compared with state-of-art time-series
and deep learning approaches, including ARIMA, LSTM and Conv-LSTM.
Key words: taxi pick-up demand, short-term prediction, deep learning, multi-source
information
iv
TABLE OF CONTENTS
ABSTRACT ....................................................................................................................... iii 1. INTRODUCTION .................................................................................................. 1
1.1 Background ......................................................................................................... 1 1.2 Problem statement ............................................................................................... 2
1.3 Motivation ........................................................................................................... 2 1.4 Thesis outline ...................................................................................................... 3
2. LITERATURE REVIEW ........................................................................................... 4 2.1 Statistical models ................................................................................................ 4
2.2 Machine learning methods .................................................................................. 6 2.3 Deep learning approaches ................................................................................... 7
2.4 Predictive models for taxi demand ..................................................................... 8 2.5 Summary ............................................................................................................. 9
3. DATA SOURCES .................................................................................................... 10 3.1 Pick-up and drop-off requests ........................................................................... 10
3.2 Exogenous data ................................................................................................. 10 3.2.1 Weather ........................................................................................................ 11 3.2.2 Location popularity ...................................................................................... 11
4. DEEP LEARNING MODELS .................................................................................. 13
4.1 Convolutional neural networks (CNN) ............................................................. 13 4.2 Recurrent neural networks (RNN) .................................................................... 15
4.3 Long-short term memory networks (LSTM) .................................................... 16 4.4 Convolutional long-short term memory networks (Conv-LSTM) .................... 18
4.5 Summary ........................................................................................................... 19 5. METHODOLOGY ................................................................................................... 21
5.1 Preliminaries ..................................................................................................... 21 5.1.1 Taxi zones and time partition ....................................................................... 22
v
5.1.2 Pick-up and drop-off demand ...................................................................... 22 5.1.3 Weather information .................................................................................... 22 5.1.4 Popularity ..................................................................................................... 23 5.1.5 Problem formulation .................................................................................... 24
5.2 Multi-source information based spatiotemporal neural network (MSI-STNN) 24 5.2.1 Structure of spatial variables ........................................................................ 25 5.2.2 Structure of temporal variables .................................................................... 25 5.2.3 Structure of spatiotemporal variables .......................................................... 26 5.2.4 Information fusion ....................................................................................... 27 5.2.5 Objective function ........................................................................................ 27
6. CASE STUDY .......................................................................................................... 29
6.1 Study site and dataset ........................................................................................ 29 6.2 Performance metrics ......................................................................................... 32
6.2.1 Mean squared error (MSE) .......................................................................... 32 6.2.2 Theil’s inequality coefficient (TIC) ............................................................. 32
6.3 Sensitivity analysis............................................................................................ 33 6.3.1 Epoch ........................................................................................................... 33 6.3.2 Optimizer ..................................................................................................... 35 6.3.3 Batch size ..................................................................................................... 37 6.3.4 Summary ...................................................................................................... 39
6.4 Performance diagnostics ................................................................................... 39 6.4.1 Comparison between baselines and MSI-STNN ......................................... 39 6.4.2 Evaluation using K-fold cross validation ..................................................... 40
6.5 Data visualization .............................................................................................. 42 6.5.1 Temporal visualization ................................................................................. 42 6.5.2 Spatial visualization ..................................................................................... 43
7. CONCLUSION ......................................................................................................... 44
7.1 Contributions .................................................................................................... 44 7.2 Future research .................................................................................................. 45
REFERENCES ................................................................................................................. 51
vi
LIST OF TABLES Table 1 Comparison of basic deep learning models ......................................................... 20 Table 2 Comparison of various optimizers ....................................................................... 36 Table 3 Performance comparison of various batch sizes .................................................. 38 Table 4 Performance comparison between the proposed model and baselines ................ 40 Table 5 Performance comparison between the proposed model and the model with K-fold cross-validation ................................................................................................................. 41
vii
LIST OF FIGURES Figure 1 Illustration of a typical CNN architecture .......................................................... 14 Figure 2 Illustration of a typical RNN architecture (adopted from [47]) ......................... 15 Figure 3 Illustration of a standard LSTM cell (adopted from [48]) .................................. 17 Figure 4 Framework of the proposed MSI-STNN model ................................................. 25 Figure 5 Taxi zones in Manhattan .................................................................................... 30 Figure 6 Daily pick-up demand in January 2018 .............................................................. 31 Figure 7 Spatial distribution of pick-up demand on January 26th, 2018 between 6 p.m. and 7 p.m. .......................................................................................................................... 31 Figure 8 Initial training process with 100 epochs ............................................................. 34 Figure 9 Framework of the modified MSI-STNN model ................................................. 35 Figure 10 Training process using different optimizers ..................................................... 37 Figure 11 Training using different batch sizes ................................................................. 38 Figure 12 K-fold cross-validation (K=4, red block: training set; blue block: validation set)........................................................................................................................................... 41 Figure 13 Temporal prediction outcome for the first 1,000 timesteps ............................. 42 Figure 14 Average error distribution across zones for all the timesteps ........................... 43
1
Chapter 1
INTRODUCTION
1.1 Background
Taxi service is of great importance in urban transportation for its flexibility and
convenience. Commuting from one spot to another is necessary and unavoidable in
people’s daily life. Moreover, different means of public transportation (e.g. trains, buses,
taxis) make it easier for traveling between areas. Taxi is an end-to-end travel mode. It is
not the first choice for most commuters, though, because of the cost, possible waiting time
and the probability to encounter congestion. Almost every taxi company is interested in
exploring solutions to optimize their operations and balance demand and supply.
The new era of transportation is multidisciplinary and provides opportunities to
combine transportation with other fields such as Geographic Information System (GIS).
GIS significantly aids in planning, monitoring and managing complex systems involved in
transportation planning and management more efficiently. The use of GIS in transportation
is widespread: traffic modeling, accident analysis, route planning and highway
maintenance. Benefited from GPS technologies, location data (e.g. for taxis) can be
captured. For instance, a taxicab can be tracked with multiple information including real-
time location, unique ID, trip distance and even the fare amount.
In recent years, fast-growing smartphone apps have been changing the way people
travel in urban areas. Instead of hailing a taxi on street, people are able to make a
reservation on smartphones. Once the driver gets the request from a passenger, he/she can
receive the specific location of this passenger and choose the optimal route. This interaction
between drivers and passengers reduces the waiting time and fuel costs which benefits both
sides.
With the availability of huge amounts of taxi data, the opportunity exists to develop
methods to predict, in real time, future requests, a capability that can enhance the ability to
schedule service in more efficient and sustainable ways.
2
1.2 Problem statement
Predictive data analytics facilitates proactive decision support for both operators and
travelers in transportation. The demand forecasting problem has been widely studied in the
last decades, and a vast amount of methods exist, from model-based time series analysis to
machine learning and model-free deep learning techniques.
Traditional approaches focus on applying complex mathematical models to predict
travel demand. But they are inefficient and inaccurate compared to state-of-art machine
learning and deep learning models. Moreover, traditional time series analysis is not capable
of dealing with the large-scale datasets that are becoming available.
Taxi demand requests in urban areas vary by zone, population density, weather
conditions, and so on, which makes it complicated to forecast taxi demand. The thesis aims
to predict the real-time zonal demand (e.g. 15 minutes ahead) of taxi travels by fusing
multi-source information.
1.3 Motivation
Machine learning and deep learning techniques are popular in the latest research. In this
thesis, the goal is to develop a multi-source information based spatiotemporal neural
network (MSI-STNN) approach to fuse pick-up, drop-off and exogenous variables
simultaneously. To be specific, pick-up and drop-off data can be mined from available taxi
datasets and combined with exogenous variables, such as weather and area popularity data.
After preparing the data of pick-up demand, drop-off demand, weather conditions and
popularity, a stacked convolutional long short-term memory (Conv-LSTM) model is
applied to extract the spatiotemporal features of pick-up and drop-off demand
simultaneously rather than combining long short-term memory (LSTM) and convolutional
neural network (CNN) to acquire spatial and temporal characteristics separately. The
stacked LSTM is adopted to predict weather conditions, and CNN is used to extract zonal
popularity features among zones.
The objective of this novel model structure is to forecast the taxi pick-up demand
in next 15 minutes across all taxi zones given the multi-source data in the previous hour.
3
Our model was trained and tested with taxi data in Manhattan from 2017 to 2018.
Specifically, the training set was from the whole year of 2017, while the testing set was the
first three months of 2018. Hyperparameters of the model were fine-tuned to improve
accuracy. The proposed model results were compared with baselines under different
conditions.
1.4 Thesis outline
This thesis is organized as follows:
Chapter 2 overviews the statistical models and its application to the taxi demand
prediction problem. Machine learning methods are introduced in section 2.2. And state-of-
art deep learning approaches are reviewed and compared in section 2.3. Moreover, previous
predictive models for taxi demand are described in section 2.4.
Chapter 3 introduces four data sources with different information. The taxi pick-up
and drop-off requests are described in section 3.1. Exogenous data, used to improve the
performance of our model, are discussed in section 3.2. Multiple deep learning approaches
are used to extract information based on the characteristics of different data sources.
In chapter 4, deep learning approaches for demand forecasting are presented. There
are many variants of deep learning models. In this chapter, we focus on the basic ones and
specifically their use to the demand forecasting problem. Convolutional neural networks
(CNN), recurrent neural networks (RNN), long-short term memory networks (LSTM) and
convolutional long-short term memory networks (Conv-LSTM) are discussed.
In chapter 5, a novel deep learning model is proposed. We first pre-process the raw
datasets to prepare the inputs of our model. Then, we build a four-layer deep learning model
in section 5.2, including two Conv-LSTM models for pick-up and drop-off requests, an
LSTM model for weather data, a CNN model for location popularity.
Chapter 6 introduces a case study in which the model is evaluated with the taxi
pick-up requests in Manhattan. The chapter presents the study area and dataset in section
6.1, performance metrics in section 6.2, sensitivity analysis in section 6.3, performance
diagnostics in section 6.4 and data visualization of pick-up requests in section 6.5.
Chapter 7 concludes the thesis.
4
Chapter 2
LITERATURE REVIEW
This chapter reviews the evolution of time series modeling. In section 2.1, we first discuss
the basics of statistical models because a time series is essentially one of the techniques in
statistical analysis. In the past decades, machine learning methods were considered to be
the fundamental methodologies to solve time series forecasting problems, as introduced in
section 2.2. Recently, state-of-art deep learning approaches have become mainstream in
the data-driven analysis. In section 2.3, some basic deep learning models for time series
analysis are reviewed.
2.1 Statistical models
Statistical modeling is a simplified, mathematically formalized way to approximate reality
(i.e. what generates your data) and optionally to make predictions from this approximation.
Its application initially started in physics and is now being applied in finance, engineering,
social sciences, etc.
Time series is one of the techniques under statistical analysis, which is typically
measured over successive times, representing a sequence of data points [1]. The
measurements taken during an event in a time series are arranged in proper chronological
order. Basically, there are two types of time series: continuous and discrete. In a
continuous-time series, observations are measured at every instance of time, whereas a
discrete-time series contains observations measured at discrete points of time. Usually, in
a discrete-time series, the consecutive observations are recorded at equally spaced time
intervals such as hourly, daily, monthly, or yearly time separations. In general, to do further
analysis, the data being observed in a discrete-time series is assumed to be as a continuous
variable using the real number scale [2].
Time series analysis fits a time series into a proper model. The procedure of fitting
a time series to a proper model is referred to as Time Series Analysis. Practically,
5
parameters of the model are estimated using the known data values, which comprises
models that attempt to analyze and understand the nature of the series. These models are
useful for future simulation and forecasting after beings validated. A time series, in general,
is assumed to be affected by four main components: trend, cyclical, seasonal and irregular
components [3]. These four components can be extracted and separated from the observed
data. Considering the effects of these four components, in general, additive and
multiplicative models are used for a time series decomposition. An additive model is based
on the assumption that the four components of a time series are independent of each other.
However, in a multiplicative model, the four components can affect the others, meaning
they are not necessarily independent.
Traditional time series analysis leverages statistical models, which can predict
future values by giving the successive historical records. For example, Bayesian
Forecasting [4], the autoregressive integrated moving average (ARIMA) model [5], and
Kalman filter [6] are the most classic ones.
A Bayesian Forecasting model is essentially a dynamic linear model. The Bayesian
approach, in general, requires the explicit formulation of a model, and conditioning on
known quantities, in order to draw inferences about unknown ones. In Bayesian forecasting,
one simply takes a subset of the unknown quantities to be future values of some variables
of interest.
The autoregressive integrated moving average (ARIMA) model is a generalization
of the simpler autoregressive moving average (ARMA) [7]. It is a form
of regression analysis that gauges the strength of one dependent variable relative to other
changing variables. The model's goal is to predict future values by examining the
differences between values in the series instead of through actual values. And there are
seasonal and non-seasonal ARIMA models that can be used for forecasting.
In statistics and control theory, Kalman filtering, also known as linear quadratic
estimation (LQE), is an algorithm that uses a series of measurements observed over time,
containing statistical noise and other inaccuracies, and produces estimates of unknown
variables that tend to be more accurate than those based on a single measurement alone, by
estimating a joint probability distribution over the variables for each timeframe.
6
These algorithms have been used in various applications, such as in supply chains
[8] and traffic flow prediction [9-10]. However, these algorithms rely on a specific
mathematical model, describing the underlying process, which should be linear and have a
Gaussian distribution. This limitation is inconsistent with complicated characteristics of
data from various sources. Typically, these statistical models have been proved to be
efficient in simple stationary time series problems, which do not take multi-source data
into account.
2.2 Machine learning methods
Machine learning methods were widely applied in demand forecasting in various domains
in the last decade. Examples of such methods include support vector machines, decision
trees, and others [11]. Some of these methods are based on classic statistical approaches
[12].
Methods such as artificial neural network [13] and random forest [14] have made
great contributions to the analysis of time series in transportation. For example, an
improved K-nearest neighbors [15] and optimized artificial neural networks (ANN) [16]
were adopted in traffic flow prediction. And the accuracy of these advanced models can be
90 percent or so.
Artificial neural networks (ANN) approaches have been suggested for time series
forecasting and gained popularity in the last few years. ANNs were built on a model of the
human brain [17]. Although the development of ANNs was mainly biologically motivated,
they have been applied in various areas, especially for classification and forecasting
purposes [7]. ANNs try to recognize essential patterns and regularities in the input data,
learn from experience and then provide generalized results based on the previous
knowledge. Among ANNs models, the most used ones in forecasting problems are multi-
layer perceptron, which uses a single hidden layer feed-forward network [18]. The model
is characterized by a network of three layers connected by acyclic links: input, hidden and
output layers.
A major breakthrough in the area of time series forecasting occurred with the
development of support vector machines (SVM) [19-20]. The initial aim of SVM was to
7
solve pattern classification problems and they have been applied in many other fields such
as function estimation, regression, and time-series prediction problems [21]. A
characteristic of SVM is that it aims at a better generalization of the training data. With
SVMs, instead of depending on the whole data set, the solution usually only depends on a
subset of the training data points, called the support vectors [22]. Furthermore, with the
help of support vector kernels, the input points in SVM applications are usually mapped to
a high dimensional feature space, which often generates good generalization outcomes. For
this reason, the SVM methodology has become a technique used for time series forecasting
problems.
The outcome of machine learning algorithms depends on the input, i.e., feature
selection is necessary. Furthermore, hyperparameters should be manually calibrated to
yield the best prediction. Another shortcoming of machine learning models is their
computational requirements.
2.3 Deep learning approaches
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the
structure and function of artificial neural networks (ANN). Essentially, deep learning
models are extensions of ANN with multiple hidden layers.
Recently, advances in deep learning algorithms and their implementation have been
approaches in transportation, especially in demand forecasting. Studies have shown the
advantages of recurrent neural network (RNN) and its variants, i.e., long short-term
memory (LSTM) [23] and gated recurrent units (GRU) [24], in time series analysis. [25]
employed long short-term memory neural network (LSTM) to capture temporal
characteristics with optimal time lags for traffic speed prediction. [26] applied LSTM to
describe temporal relations for traffic state prediction, and it also added an autoencoder to
deal with extreme situations, e.g., peak hour and traffic accidents. [27] considered sharp
nonlinearities, e.g., transitions, breakdown, recovery and congestion, as major effects in
the performance of traffic flow prediction, and based on this, they put forward a combined
ℓ! regularization and a sequence of tanh layers to capture the nonlinearities. However, the
8
models mentioned above fail to consider spatial characteristics, which is an endogenous
dependency in zonal based demand forecasting.
CNNs have shown potential in computer vision. Various pacers employ CNN for
traffic prediction. Typically, the area is seen as an image, and local features are extracted
using CNN. [28] proposed a CNN based architecture to predict traffic speed. In that paper,
they treat the traffic as images. CNN is applied to extract the features of a vehicle trajectory
in each road segment, from which the vehicle speed can be estimated.
These methods do not pay much attention to temporal correlations since they
simply fuse the data extracted by CNN. Later, researchers found a suitable solution to
combine spatial and temporal characteristics together, i.e., convolutional long short-term
memory models (Conv-LSTM), first introduced in [29]. For example, [30] employed
Conv-LSTM to handle travel time and demand intensity. [31] applied Conv-LSTM to
extract the spatiotemporal features of crash risk and taxi trips. The experimental outcomes
indicate that Conv-LSTM is more reliable and efficient in terms of dealing with data that
has both time dependencies and spatial discrepancies. This method has not been widely
adopted in transportation applications.
2.4 Predictive models for taxi demand
In the last decade, GPS-location systems have attracted the attention of both researchers
and companies due to the new type of information they provide. Specifically, location-
aware sensors and information transmitted can be used to track human mobility patterns.
Rail [32], bus [33] and taxi [34] applications are already successfully exploring these traces.
Gonzalez et. al [35] uncovered the spatiotemporal regularity of human mobility. Similar
patterns can be found in other activities such as electricity load [36] and freeway traffic
[37].
In recent years, Uber and Lyft have become two popular taxi companies using
location-based services with reduced waiting time for both passengers and drivers. For
these services, the imbalance between demand and supply can reduce profits and level of
service. Wong presented a relevant mathematical model to express this need for
equilibrium in distinct contexts [38]. Lack of equilibrium may result in one of two scenarios:
9
(1) excess of vacant vehicles and excessive competition; (2) larger waiting times for
passengers and lower service reliability [39].
Knowledge about where the demand will actually emerge can be an advantage for
the drivers. The GPS historical data is one of the main variables of this topic because it can
reveal underlying mobility patterns. This kind of data represents a new opportunity to learn
relevant patterns while the network is operating in real-time.
Several researchers have already explored this data successfully with distinct
applications like modeling the spatiotemporal structure of taxi services [40], smart driving
[41] and building intelligent passenger-finding strategies [42]. Despite their useful insights,
the majority of techniques reported are based on offline tests, discarding some of the main
advantages of the real-time information. In other words, they do not provide any online
information about expected future or the best place to pick up passengers in real-time. One
of the recent advances on this topic was presented by Moreira Matias [43], where a discrete-
time series framework is proposed to forecast the service demand. This framework handles
three distinct types of memory range: short term, midterm, and long term [44-45].
2.5 Summary
In this chapter, we have reviewed different approaches in time series modeling and demand
forecasting, including statistical, machine learning, and state-of-art deep learning models.
Traditional statistical models have been widely used in various domains. Machine learning
methods are considered as a more accurate and reliable way for demand forecasting, in
which artificial neural networks (ANN) and support vector machines (SVM) are among
the most popular ones. But in the era of big data, neither statistical nor machine learning
models can deal with large-scale datasets. With the help of GPU boosting, deep learning
approaches have become popular to handle complicated time series problems. By applying
these mathematical forecasting techniques to predict taxi demand, operating costs, waiting
time of passengers and idle taxis can be reduced. However, most of the studies used only
time-series demand data for prediction without considering important exogenous variables,
such as weather or location popularity, etc. It is not trivial to fuse multi-source information
to make a real-time prediction using deep learning techniques.
10
Chapter 3
DATA SOURCES
In this chapter, different data sources for our model are introduced. In this thesis, we focus
on predicting next 15-minute pick-up demand giving the time-series information from the
previous hour. To do this, many features can be considered to build a good model.
According to intuition and knowledge of transportation demand modeling, we leverage
several factors that are very likely to affect taxi demand: taxi pick-up requests, taxi drop-
off requests, weather and popularity.
3.1 Pick-up and drop-off requests
With the help of GPS-location services, creative companies (e.g. Uber and Lyft) have
introduced new mobility services that customers can call for a taxi on a smartphone instead
of randomly taking it on the street. Once a passenger successfully finds a taxi, the trip
information containing time and location is clear. And these real-time taxi trip records are
the key to build time series demand forecasting model.
The area of interest can be divided into zones. We first aggregate the dataset by
time intervals (i.e., 15 minutes), and then the demand data in a zone in each time interval
was obtained by summing up the requests from that zone during that time interval. The
dataset is then divided into pick-up and drop-off datasets.
3.2 Exogenous data
Traditional time series analysis pays much attention to the temporal characteristics. But as
is shown in Figure 4, we can tell that both temporal and spatial characteristics should be
taken into account in demand forecasting. In this thesis, pick-up and drop-off requests
contain both temporal (i.e. travel time) and spatial (i.e. location ID) characteristics. In order
11
to get a more accurate prediction, we consider some exogenous data (i.e. popularity and
weather) that may impact demand.
3.2.1 Weather
Weather data can be collected from various sources. For example, in the USA, the National
Oceanic and Atmospheric Administration (NOAA) website provides hourly aggregated
weather information. The information includes hourly maximum temperature, minimum
temperature, precipitation, average wind speed and snowfall.
Temperature is the most straightforward factor. It has strong temporal
characteristics, e.g., the maximum and minimum temperature during a day generally show
up in the afternoon and midnight, separately. Moreover, seasonality is another feature of
temperature. Typically, the average temperature in winter is much lower than that in the
summertime.
Precipitation is another aspect of weather condition. Wind and snow are two factors
that affect people’s activities in multiple ways. For instance, people would prefer public
transportation if they encounter a snowstorm. Or they would rather stay at home if there is
an intensive wind.
Overall, weather information is of great importance in demand forecasting,
especially in some extreme cases (e.g. snowstorm and rainstorm).
3.2.2 Location popularity
Popularity is another exogenous data that can be used in demand forecasting. Generally,
location popularity is a collective perception, which means it takes a group of people to
have the same feeling towards individuals or objects. When rating popularity, different
groups hold either positive or negative point of view. The more attention a location gets,
the more popular it will be.
Location popularity can be numerically represented by the number of reviews from
points of interest (POIs). If a POI gets numerous reviews from visitors, it indicates that this
site is popular among people. So, we can use the number of reviews from POIs as a spatial
12
feature and use it for taxi demand forecasting, i.e., the POIs with more reviews in a region
have higher weights when predicting the demand in that zone.
In this thesis, we utilize the Yelp Fusion API to extract the number of historical
reviews at each point of interest. In Manhattan, for example, there are 11,760 POIs that
include most of the businesses there. We then merge the review counts in each taxi zone to
obtain the average, median, maximum, minimum and standard error values to represent the
popularity of each zone. We assume that popularity is stationary during the analysis period,
which implies that popularity is only a zonal based feature.
13
Chapter 4
DEEP LEARNING MODELS
Deep learning approaches have been used for building accurate demand forecasting models.
It is essentially part of machine learning and is based on neural networks. Each neuron is
treated as a ‘black box’, and we only focus on the input and output layers. Deep learning
models have a better performance on large datasets compared to traditional machine
learning approaches. In this chapter, we introduce four popular deep learning models:
convolutional neural networks (CNN), recurrent neural networks (RNN), long-short term
memory networks (LSTM) and convolutional long-short term memory networks (Conv-
LSTM). We do not dig into the details and variations of each model, instead, we focus on
the basic versions of each model and discuss its pros and cons.
4.1 Convolutional neural networks (CNN)
Convolutional neural networks (CNN) are a specialized type of neural network that has
proven effective in areas such as image recognition and classification. CNN has been
successful in identifying faces, objects, and traffic signs apart from powering vision in
robots and self-driving cars.
Central to the convolutional neural network is the convolutional layer that gives the
network its name. This layer performs an operation called a “convolution “. In the context
of a convolutional neural network, convolution is a linear operation that involves the
multiplication of a set of weights with the input, much like a traditional neural network.
Given that the technique was designed for two-dimensional input, the multiplication is
performed between an array of input data and a two-dimensional array of weights, called a
filter or a kernel.
The filter is smaller than the input data and the type of multiplication applied
between a filter-sized patch of the input and the filter is a dot product. A dot product is an
element-wise multiplication between the filter-sized patch of the input and filter, which is
14
then summed, always resulting in a single value. Because it results in a single value, the
operation is often referred to as the “scalar product “. Using a filter smaller than the input
is intentional as it allows the same filter (set of weights) to be multiplied by the input array
multiple times at different points on the input. Specifically, the filter is applied
systematically to each overlapping part or filter-sized patch of the input data, left to right,
top to bottom.
A typical CNN architecture is shown in Fig. 1. Unlike a fully connected neural
network in which the hidden activation H is computed by multiplying the entire input V
and weights W, CNN leverage convolution kernels to multiply a small local input (i.e., [v1,
v2, v3]) against the weights W. Then the kernel moves to next local input (i.e., [v2, v3, v4]),
which means the kernel is fixed and the weights W are shared across the entire input V.
After computing the hidden units, a maxpooling layer with filters of given pooling size
(e.g., 2 ´ 2) outputs the maximum of activations in each filter, which means every MAX
operation discards 75% of the activations if the filter size is 2 ´ 2. Compared with fully
connected neural networks, CNN can progressively reduce the number of parameters and
avoid overfitting.
Figure 1 Illustration of a typical CNN architecture
15
4.2 Recurrent neural networks (RNN)
Recurrent neural networks are another specialized type of neural networks where
the output from the previous step are fed as input to the current step. In traditional neural
networks, all the inputs and outputs are independent, but in the case when predicting the
next word of a sentence, the previous words are required, and hence there is a need to
remember them. RNN solve this issue with the help of a Hidden Layer. The main and most
important feature of RNN is the Hidden state, which remembers some information about a
sequence.
RNN has a “memory” which remembers all information about what has been
calculated. It uses the same parameters for each input as it performs the same task on all
the inputs or hidden layers to produce the output. This reduces the complexity of the
parameters, unlike other neural networks. Fig. 2 shows the basic RNN structure and its
unrolled version. At a particular timestep t, X (t) is the input to the network and h (t) is the
output of the network. A is the RNN cell which contains neural networks just like a
feedforward net.
Figure 2 Illustration of a typical RNN architecture (adopted from [47])
16
First, RNN takes the X (0) from the sequence of input and then outputs h(0) which
together with X(1) is the input for the next step. Next, h (1) from the next step is the input
with X (2) for the next step and so on. With this recursive structure, RNN keeps
remembering the context while training. We can define the values of hidden units using Eq.
(1):
ℎ" = ∅(W ∙ 𝑋" + U ∙ ℎ"#!)(1)
where 𝒉𝒕 is the hidden state at timestamp t, ∅ is the activation function (either Tanh or
Sigmoid), 𝐖 is the weight matrix for input to hidden layer at timestamp t, 𝑿𝒕 is the input
at timestamp t, 𝐔 is the weight matrix for the hidden layer at time t-1 to hidden layer at
time t, and 𝒉𝒕#𝟏 is the hidden state at timestamp t.
RNN learn weights 𝐔 and 𝐖 through training using back propagation. These
weights decide the importance of the hidden state of the previous timestamp and the
importance of the current input. Essentially, they decide how much value from the hidden
state and the current input should be used to generate the current input. The activation
function ∅ adds non-linearity to RNN, thus simplifying the calculation of gradients for
performing back propagation.
4.3 Long-short term memory networks (LSTM)
Long-short term memory networks (LSTM) are actually a special recurrent neural network
(RNN) architecture. Sometimes we only focus on recent information to perform the present
task. For example, we consider building a language model to predict the next word based
on previous ones in a sentence. If we want to know the last word in the sentence of “The
clouds are in the --.”, we do not need much information but the word ‘clouds’. It is pretty
obvious that the last word is going to be ‘sky’. In such cases, RNNs can deal with such
prediction tasks very well.
Unfortunately, RNNs are not always capable of handling long-term dependencies,
which LSTMs handle well. Consider trying to predict the last word in the text “I grew up
in China……I speak fluent --.” The recent context indicates that the last word is probably
17
the name of a language. But we need to know the context of ‘China’ from the further back
if we want to predict ‘Chinese’ for the last word. LSTMs are explicitly designed to capture
this long-term dependency problem.
LSTMs also have this chain-like structure, but the LSTM cell is quite special. In
standard RNNs, each cell has only one single neural network layer, while there are 4 in
each LSTM cell (i.e. shown in Fig. 3). As demonstrated in Eqs. (2)-(7), ft, it, Ct, ot represent
the forget gate, input gate, memory cell, and output gate, respectively, sharing the same
dimension with ht.
Figure 3 Illustration of a standard LSTM cell (adopted from [48])
𝑓" = σ4𝑊& ∙ [ℎ"#!, 𝑥"] + 𝑊'& ⨀ 𝐶"#! +𝑏&=(2)
𝑖" = σ(𝑊( ∙ [ℎ"#!, 𝑥"] + 𝑊'( ⨀ 𝐶"#! +𝑏()(3)
𝐶A" = tanh(𝑊' ∙ [ℎ"#!, 𝑥"] + 𝑏')(4)
𝐶" =𝑓"⨀𝐶"#! + 𝑖"⨀𝐶A"(5)
𝑜" = σ(𝑊) ∙ [ℎ"#!, 𝑥"] + 𝑊') ⨀ 𝐶"#! +𝑏))(6)
ℎ" =𝑜"⨀tanh(𝐶")(7)
𝑊& , 𝑊( , 𝑊' , 𝑊) , 𝑊'& , 𝑊'( , 𝑊') are matrices of weights which conduct a linear
transformation, while 𝑏& , 𝑏( , 𝑏' , 𝑏) are parameters of bias. It is noteworthy that 𝐶" is a
parameter matrix which acts as an accumulator of the state information. Every time a new
input comes, the information will be accumulated into the memory cell 𝐶" once the input
18
gate 𝑖" is activated. Also, the past cell status 𝐶"#! could be forgotten if the forget gate 𝑓" is
on. Whether the latest cell output 𝐶" will be propagated to the final state is further
determined by the output gate 𝑜" . The operator ⨀ refers to the Hadamard product that
conducts an element-wise multiplication operation. σ and tanh are two non-linear
activation functions given by:
σ(𝑥) = !!*,!"
(8)
tanh(𝑥) = ,"#,!"
,"*,!"(9)
LSTMs are more efficient and accurate than standard RNNs specifically for the
taxi demand forecasting problems. The idea is that taxi demand has a strong pattern across
the time of day and day in a week. For example, taxi demand has a morning peak and
evening peak during the day. If we remember the morning peak from a previous day, we
may get a better prediction outcome for the next morning peak, even if we only have inputs
of historical records from the previous hour.
4.4 Convolutional long-short term memory networks (Conv-LSTM)
Although the LSTM layer has been proven to be quite suitable to handle data with temporal
characteristics, it lacks the ability to extract spatial information. To address this problem,
we introduce the convolutional long-short term memory network (Conv-LSTM), which is
an extension of LSTM.
In order to capture the spatial information, all elements in Conv-LSTM, including
input, output, hidden state, memory cells, input gate, output gate, and forget gate, are
resized to 3D tensors with the last two dimensions (i.e., height and width) representing the
spatial characteristics. We assume that the input is an “image” in each timestep. Then the
spatiotemporal information of image flows in the Conv-LSTM cells. The future output is
determined by both output and input from previous timesteps. This process can be achieved
by leveraging convolution operator in the state-to-state and input-to-state transitions. So,
we just need to replace the dot product operator (“ ∙ ”) in LSTM with the convolution
operator (“ * ”). The key equations are given in Eqs. (10)-(15) below:
19
𝒻𝓉 = σ4𝒲𝒻 ∗ [𝒽𝓉#!, 𝓍𝓉] + 𝒲𝒞𝒻 ⨀ 𝒞𝓉#! +𝒷𝒻=(10)
𝒾𝓉 = σ(𝒲𝒾 ∗ [𝒽𝓉#!,𝓍𝓉] + 𝒲𝒞𝒾 ⨀ 𝒞𝓉#! +𝒷𝒾)(11)
𝒞A𝓉 = tanh(𝒲𝒞 ∗ [𝒽𝓉#!,𝓍𝓉] + 𝒷𝒞)(12)
𝒞𝓉 =𝒻𝓉 ⨀𝒞𝓉#! + 𝒾𝓉 ⨀𝒞A𝓉 (13)
ℴ𝓉 = σ(𝒲ℴ ∗ [𝒽𝓉#!,𝓍𝓉] + 𝒲𝒞ℴ ⨀ 𝒞𝓉#! +𝒷ℴ)(14)
𝒽𝓉 =ℴ𝓉 ⨀tanh(𝒞𝓉)(15)
The forget gate tensors, input gate tensors, memory cell tensors, output gate tensors,
hidden state tensors, input tensors at 𝓉 timestep are denoted as 𝒻𝓉 , 𝒾𝓉 , 𝒞𝓉 , ℴ𝓉 , 𝒽𝓉 , 𝓍𝓉 Î
𝑅2×4×5, respectively, where H and W (i.e., height and width) stand for spatial dimensions.
And all the weight tensors, including 𝒲𝒻 , 𝒲𝒾 , 𝒲𝒞 , 𝒲ℴ , 𝒲𝒞𝒻 , 𝒲𝒞𝒾 , 𝒲𝒞ℴ , are fixed for
each convolution kernel, which implies that the weights are shared when the kernel moves.
We can extract the different spatial features (e.g., congestion and crowds) by using multi
kernels. To make sure that the states have the same size of height and width as inputs, we
represent no values with zeros when the kernel moves to the boundary, which is called the
zero-padding.
For a large-scale area, the taxi demand will depend on not only temporal features
but also spatial characteristics, e.g., the central business district (i.e. CBD) of a city tends
to have more taxi demand. Conv-LSTM is an appropriate model to deal with such
spatiotemporal datasets.
4.5 Summary
In this chapter, we discussed four typical deep learning models that can be adopted in taxi
demand forecasting, including CNN, RNN, LSTM and Conv-LSTM. We presented the
basic version of each model and analyzed their pros and cons, as shown in Table 2.
20
Table 1 Comparison of basic deep learning models Model Pros Cons RNN CNN LSTM Conv-LSTM
-Time series analysis -Spatial information extraction -efficiency in the training process -Time series analysis -Capable of handling long-term dependency -time series analysis - Capable of handling long-term dependency -Spatiotemporal information extraction
-costly computation time -not capable of handling long-term dependency -not widely used in time series analysis -not capable of extracting spatial information -costly computation time -costly computation time
However, a single deep learning model is not able to capture multiple features from
different sources. In such cases, we need to ensemble those models so that we can
incorporate multiple information sources and get accurate predictions. In the next chapter,
the proposed model is represented. It is an adaptive demand forecasting model that fuses
multiple models to build a better estimator.
21
Chapter 5
METHODOLOGY
As introduced in chapter 4, each deep learning approach has not only advantages but also
limitations. In other words, a simple deep learning model can be adopted in a certain
situation, e.g., LSTM is more effective at dealing with time-dependent data while CNN is
good at extracting spatial information.
Generally, taxi demand patterns can be complex with various relationships between
demand and exogenous data. In such cases, it is hard to get optimal performance in terms
of prediction accuracy even if we select the most appropriate model from the candidates
above. In this chapter, we propose a novel deep learning structure which combines several
basic approaches into a stronger one. First, we use different models to extract different
information (i.e. demand, weather and popularity). Then, we fuse those data using the
ensemble method. Finally, we train the model and get the best weights combination.
5.1 Preliminaries
The short-term taxi demand forecasting problem is inherently a time series prediction
problem, which implies that we can use taxi demand from previous time periods as valuable
information for future prediction. This thesis focuses on the prediction of taxi pick-ups in
the next 15 minutes. Conceptually, the number of drop offs in previous periods could
influence taxi pick-up demand, since people may potentially take a taxi to a place for a
particular activity (e.g., concert, party, work, etc.), and then return by using a taxi again. In
addition, weather conditions and location popularity could also impact the demand
generation. Here, the definitions and notations of the variables used in this thesis are
described:
22
5.1.1 Taxi zones and time partition
Taxi pick-up demand is both time-dependent and zonal-based. First, we need to define the
temporal and spatial characteristics, which are time interval and taxi zones, respectively.
The urban area is partitioned into small grids with an irregular shape, where each
grid represents a taxi zone. The driver can pick up passengers in one zone and drop them
off in another zone. Then, each taxi zone has both pick-up and drop-off records. For the
time interval, we aggregate variables in every 15 minutes, which implies the time interval
is 15 minutes. In this thesis, we focus on predicting pick-up demand in the next 15 minutes
using data from the previous hour. In other words, the inputs for pick-up and drop-off
requests will be the data in four intervals and output has one interval.
5.1.2 Pick-up and drop-off demand
As discussed above, pick-up and drop-off demand have the same dimensions. So, we can
regard them as a two-dimension variable. The number of requests is aggregated in each
time interval (i.e. 15 minutes) and taxi zone.
The pick-up demand at the tth time slot (e.g., 15 min) in zone i is defined as the
number of pickups during this time interval within the zone, denoted by pt,i. The pick-up
demand for all the zones in each time interval is defined as matrix 𝑃", where the ith element
is (𝑃")i = pt,i. A similar definition for drop-offs where dt,i denotes the number of drop-off
records at the tth time slot (e.g., 15 min) in zone i. Again, the drop-off records for all the
zones in each time interval is kept in matrix 𝐷", where the ith element is (𝐷")i = dt,i.
5.1.3 Weather information
Weather is another input in our model. It contains several features which have an impact
on human mobility and activity behavior. For instance, some extreme weather conditions
like a blizzard and dense fog will directly influence both passengers and drivers on their
travel choices.
In this thesis, we consider 7 categories of weather variables, including maximum
temperature (measured in Fahrenheits), minimum temperature (Fahrenheits), precipitation
23
(millimeters), average wind speed (meters per second), snowfall (inches), smoke or haze
(dummy variable), heavy fog or heavy freezing fog (dummy variable).
All of the aforementioned variables have hourly values (i.e., variables are taken
average per hour). And we assume that weather variables only have temporal dependencies
(i.e., variables have the same values across zones).
We denote these variables in different ways since we have both numerical and
categorical variables. Thus, maximum temperature, minimum temperature, precipitation,
average wind speed and snowfall at time t slot are denoted by wt,tmax, wt,tmin, wt,prcp, wt,awnd,
wt,snow, respectively.
We introduce dummy variable wt,skhz to characterize the attribute of smoke or haze,
given by:
𝑤",789: = \1,𝑖𝑓𝑠𝑚𝑜𝑘𝑒𝑜𝑟ℎ𝑎𝑧𝑒ℎ𝑎𝑝𝑝𝑒𝑛𝑠𝑎𝑡𝑡"9𝑡𝑖𝑚𝑒𝑠𝑙𝑜𝑡0,𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
We also denote another dummy variable wt,hfog to be heavy fog:
𝑤",9&); = \1,𝑖𝑓ℎ𝑒𝑎𝑣𝑦𝑓𝑜𝑔ℎ𝑎𝑝𝑝𝑒𝑛𝑠𝑎𝑡𝑡"9𝑡𝑖𝑚𝑒𝑠𝑙𝑜𝑡0,𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
5.1.4 Popularity
Popularity is a novel variable that we adopt in our demand forecasting problem. It reflects
the importance of a specific spot or region. In other words, it keeps spatial information of
different weights for different spots or regions. Popularity can be represented by various
features. In this thesis, POIs are used to measure location popularity.
A point of interest (POI) represents a specific point location that people may find
useful and interesting. So, POIs in a region inform the popularity of this region. We assume
that popularity is a zonal-based attribute. The popularity is defined as the number of
reviews of POI within the zone. The average review number, median review number,
minimum review number, maximum review number and standard deviation in zone i are
defined as the average value, median value, minimum value, maximum value and standard
deviation of location popularity, denoted by 𝑟<=;( , 𝑟>,?( , 𝑟>(@( , 𝑟><A( and 𝑟7"?( , respectively.
24
5.1.5 Problem formulation
The prediction problem is to predict pick-up demand for the next time interval (15 min) at
each taxi zone using pick-ups, drop-offs, weather and location popularity information. It
can be formulated as:
𝑃k"*! = (𝑃7|𝑠 = 𝑡, 𝑡 − 1,… , 𝑡 − 𝑚;𝐷7|𝑠 = 𝑡, 𝑡 − 1,… , 𝑡 − 𝑚;𝑅:|𝑧 = 1,2… ,63;𝑊7|𝑠
= 𝑡, 𝑡 − 1,… , 𝑡 − 𝑚)(16)
Where 𝑃k"*! is the taxi pick-up demand prediction for the next 15 minutes. 𝑚 is the look-
back time window. And 𝑃 , 𝐷 , 𝑅 , 𝑊 represent pick-up demand, drop-off demand,
popularity and weather, respectively.
5.2 Multi-source information based spatiotemporal neural network (MSI-STNN)
This section presents the proposed deep learning architecture, i.e., MSI-STNN, to integrate
spatiotemporal variables (i.e., pick-up and drop-off records) and exogenous information
(i.e., weather and popularity) for short-term taxi demand forecasting. Specifically, the
method is composed of 3 deep learning models: convolutional long short-term memory
(Conv-LSTM), long short-term memory (LSTM) and convolutional neural network (CNN),
which are utilized to capture different characteristics. We subsequently concatenate these
extracted features to get the prediction outcome.
We propose the multi-source information based spatiotemporal neural network
(MSI-STNN) model to integrate data from multiple sources into our deep learning
architecture. The structure of MSI-STNN is shown in Fig. 4. The stacked Conv-LSTM
layers are used to handle the spatiotemporal variables (i.e., pick-up and drop-off demand).
The LSTM layers are implemented to deal with the temporal variables (i.e., weather), while
the CNN layers are adopted to extract the popularity features in each zone. The encoded
information from the different sources is concatenated, and two fully connected layers (i.e.,
dense layer) are used as the decoder.
25
Figure 4 Framework of the proposed MSI-STNN model
5.2.1 Structure of spatial variables
The point of interest (POI) in the studied area is the spatial variable. The reason is that the
number of POIs reflects the popularity in the zone. A CNN architecture is applied to extract
information about location popularity. The formulation is given below:
𝑟( = 4𝑟<=;( , 𝑟>,?( , 𝑟>(@( , 𝑟><A( , 𝑟7"?( =(17)
R = 4𝑅!, … , 𝑅(= = ℱBCDEFGGHℱB')@=ℱ!CDEFGGHℱ!')@=4𝑟!, … , 𝑟(=(18)
𝒱kI = 𝜎(𝑤J ∗ 𝑅 +𝑏J)(19)
where 𝑟<=;( , 𝑟>,?( , 𝑟>(@( , 𝑟><A( , 𝑟7"?( are the average, median, minimum, maximum and
standard deviation values of the popularity in zone i, respectively. And 𝑤J , 𝑏J are the
weights and bias, respectively.
5.2.2 Structure of temporal variables
Weather conditions have no variability across zones, which implies that weather is a non-
spatial variable. But it changes as the time goes by, which means it is a temporal variable.
The change of weather conditions, especially some extreme weather conditions (e.g.,
26
snowstorm and rainstorm), would affect the pickup demand. A sequence of vectors defined
as: 𝑝" = (wt,tmax, wt,tmin, wt,prcp, wt,awnd, wt,snow, wt,skhz, wt,hfog), is fed into the stacked LSTM
architecture, and transferred to the encoded weather information 𝒱k"𝓌 .
(𝑃"#8HL , … , 𝑃"#!HL ) = ℱHLHMNC …ℱ!HMNC(𝑝"#8 , … , 𝑝"#!)(20)
𝒱k"𝓌 = 𝜎(𝑤F ∙ 𝑃"#!HL +𝑏F)(21)
where 𝑘 is the look-back time window, while 𝐿𝑤 is the last layer of the stacked LSTM
architecture. 𝑃"#8HCDE , … , 𝑃"#!HCDE stands for the output of the last layer. 𝑤F, 𝑏F are the values
for the weights and bias.
5.2.3 Structure of spatiotemporal variables
The spatiotemporal variables (i.e., pick-up and drop-off demand) share the same training
structure, but they are handled separately. Each one is composed of stacked Conv-LSTM
layers, one batch normalization layer (BN) and one dropout layer (DO). The formulation
of the architecture for the spatiotemporal variables is given in Eq. (22) – (27).
ℎ" = 4𝑝",!, … , 𝑝",(=(22)
𝑑" = 4𝑑",!, … , 𝑑",(=(23)
4𝐻"#8HO , … , 𝐻"#!
HO = = ℱHOPGℱHOQ2ℱHO')@=HMNC …ℱ!')@=HMNC(ℎ"#8 , … , ℎ"#!)(24)
𝒱k"𝓅 = 𝜎4𝑤4 ∙ 𝐻"#!
HO +𝑏4=(25)
4𝐷"#8H? , … , 𝐷"#!H? = = ℱH?PGℱH?Q2ℱH?')@=HMNC …ℱ!')@=HMNC(𝑑"#8 , … , 𝑑"#!)(26)
𝒱k"𝒹 = 𝜎4𝑤P ∙ 𝐷"#!H? +𝑏P=(27)
where (ℎ"#8 , … , ℎ"#!) , (𝑑"#8 , … , 𝑑"#!) are the input vectors for pick-up and drop-off
demand, respectively. 𝒱k"𝓅 , 𝒱k"𝒹 are the encoded output vector of pick-up and drop-off
demand, respectively. 𝑘 is the look-back time window. 𝑖 is the number of zones. 𝐿𝑝, 𝐿𝑑
are the last layer of Conv-LSTM for pick-up and drop-off demand, respectively. And 𝑤4,
𝑤P, 𝑏4, 𝑏P are the weights and bias parameters.𝜎 is the sigmoid function which has been
defined in Eq. (8).
27
5.2.4 Information fusion
The encoded information from different sources is concatenated and then decoded by
deploying two dense layers. The predicted pick-up demand at time t is:
𝒱k" = 4𝒱k"𝓅 , 𝒱k"𝒹 , 𝒱k"𝓌 , 𝒱kI=(28)
𝒱k",TU"(><", = ℱB?,@7,ℱ!?,@7, (𝒱k")(29)
where 𝒱k" is the concatenated vector, while 𝒱k",TU"(><", is the ultimate prediction at time t.
5.2.5 Objective function
For the training process, mean square error (MSE) between prediction and actual demand
is used as the loss function, given by:
loss = !@∑ (𝒱k",TU"(><", − 𝒱",<V"T<U)B@"W! (30)
where n is the number of timesteps. The error can be reduced by learning the weights and
bias through back propagation. The training process of this model is illustrated in
Algorithm 1:
Algorithm 1. MSI-STNN training
Input Pick-up demand observations {P1,…,Pn} in training set
Drop-off demand observations {D1,…,Dn} in training set
Weather observations {W1,…,Wn} in training set
Location popularity observations {R1,…,Rn} in training set
Input time step: K + 1
Input zone id: i
Output MSI-STNN with learnt parameters
Procedure MSI-STNN train
1: Initialize a null set: V ← ∅
2: for all available time intervals t (1 ≤ 𝑡 ≤ 𝑛) do
3: 𝔙"O = [ℎ"#8 , … , ℎ"#!]
4: 𝔙"? = [𝑑"#8 , … , 𝑑"#!]
5: 𝔙"L = [𝑝"#8 , … , 𝑝"#!]
28
6: 𝔙I = [𝑟!, … , 𝑟(]
where 𝑝" = (𝑤","><A , 𝑤",">(@, 𝑤",OIVO, 𝑤",<L@? , 𝑤",7@)L , 𝑤",789: , 𝑤",9&);) , where
𝑟( = (𝑟<=;( , 𝑟>,?( , 𝑟>(@( , 𝑟><A( , 𝑟7"?( ) u where 𝔙"O , 𝔙"
? , 𝔙"L , 𝔙I are the input sets of
different categories of explanatory variables in one sample.
7: training sample (𝔙"O, 𝔙"
?, 𝔙"L, 𝔙I) is put into V
8: end for
9: Initialize all the weights and bias parameters
10: repeat
11: Randomly extract a batch of samples 𝑽𝒃 from V
12: Update the parameters by minimizing the objective function shown in Eq. (29)
within 𝑽𝒃
13: until convergence criterion met
14: end procedure
29
Chapter 6
CASE STUDY
In the previous chapter, the preliminaries of different variables and our model structure
have been proposed. But how do they perform to solve real-world problems? This chapter
introduces a case study using taxi data from Manhattan, NY, to test and analyze the
performance of our model. As one of the most congested and busiest regions in the world,
Manhattan has a remarkably rich history of taxi services. A huge dataset consisting of one-
year Manhattan taxi trips is used for training. A three-month dataset is used for testing. In
section 6.3, we perform sensitivity analysis (i.e. hyperparameter fine-tuning) to get the
optimal model structure. Analysis of results to test our model’s feasibility is presented in
section 6.4. We conclude the chapter with visualizations of the prediction outcomes using
both spatial and temporal views.
6.1 Study site and dataset The study site is located in Manhattan, NY, which has been partitioned initially into 69 taxi
zones, as is shown in Fig. 5. The zones with ID: 103, 104, 105, 153, 194 and 202 are not
considered in this thesis, where the taxi demands are mostly zeros. The reason is that we
want to avoid sparsity problems in our input matrix if we take those zones into account.
That is, most cells will become zero, which has a great impact on the weights updating in
training. This can lead to inaccuracy of the prediction outcome.
30
Figure 5 Taxi zones in Manhattan
The taxi pick-up and drop-off requests are extracted from NYC Taxi & Limousine
Commission (TLC) during the period from January 1st, 2017 to March 31st, 2018.
According to TLC company, the Manhattan area has been partitioned into several taxi
zones, which we will elaborate later. The dataset contains 200,000 of yellow taxi requests
per day in total, and each of them includes pick-up time, drop-off time, pick-up location
ID, and drop-off location ID. Each time record is a timestamp while each location ID is
related to a specific taxi zone (e.g. if location ID is 1, then this record belongs to the taxi
zone 1).
Both, pick-up and drop-off datasets are partitioned into a training set comprised of
requests between January 1st, 2017 and December 31st, 2017, and a testing set containing
the rest of observations from January 1st, 2018 to March 31st, 2018. Fig. 6 shows the daily
pick-up records in January 2018, which is clear that the pick-up demand on January 4th,
2018 is extremely low, and the reason is that there was a snowstorm during that day, which
may have a great impact on the transportation. Fig. 7 shows the spatial distribution of pick-
31
up demand on January 26th, 2018, between 6 p.m. and 7 p.m., from which we can see that
most of the pick-up requests increased in the middle part of Manhattan. This spatiotemporal
characteristic is a great challenge for the taxi short-term demand forecasting, which inspires
us to introduce popularity and weather datasets to obtain a better prediction.
Figure 6 Daily pick-up demand in January 2018
Figure 7 Spatial distribution of pick-up demand on January 26th, 2018
between 6 p.m. and 7 p.m.
32
6.2 Performance metrics
In this section, we introduce two metrics to test our model performance. One is mean
squared error (MSE) and the other one is Theil’s inequality coefficient (TIC) [46]. MSE is
a typical and traditional way in statistics, especially in regression problems. TIC was first
used in the business domain (e.g. economic forecasts), and then introduced to engineering.
6.2.1 Mean squared error (MSE)
The mean squared error (MSE) tells you how close a regression line is to a set of points. It
does this by taking the distance from the points to the regression line (these distances are
the “errors”) and squaring them. The squaring is necessary to remove any negative signs.
It also gives more weight to larger differences. It is called the mean squared error and
formulated as:
𝑀𝑆𝐸 = !2∑ (𝑌( − 𝑌k()B2(W! (31)
6.2.2 Theil’s inequality coefficient (TIC)
Theil’s inequality coefficient (TIC), also known as Theil’s U, provides a measure of how
well a time series of estimated values compares to a corresponding time series of observed
values. Theil’s U is calculated as:
𝑈 = Y#$∑ (\%
&'()#\%*+,)-$%.#
Y#$∑ (\%
&'())-$%.# *Y#
$∑ (\%*+,)-$%.#
(32)
where 0 ≤ 𝑈 ≤ 1. 𝑌OI,? is the prediction outcome, and 𝑌)^7 is the actual observation.
The model performs well if U is close to 0 (i.e. the error between prediction and observation
is small).
U can be decomposed into the bias (𝑈^ ), variance (𝑈= ) and covariance (𝑈V )
components given by:
𝑈^ = (\_&'()#\_*+,)-#$∑ (\%
&'()#\%*+,)-$%.#
(33)
𝑈= = (M&'()#M*+,)-#$∑ (\%
&'()#\%*+,)-$%.#
(34)
33
𝑈V = B(!#`)M&'()M*+,#$∑ (\%
&'()#\%*+,)-$%.#
(35)
where 𝑌�OI,?, 𝑌�)^7, 𝑆OI,?, 𝑆)^7 are the means and standard deviations of the observed and
predicted measurements, respectively, and 𝜌 is the correlation coefficient between the two
sets of measurements. And the three components satisfy the relationship below:
𝑈^ + 𝑈= + 𝑈V = 1(36)
The bias proportion reflects the systematic error. The variance proportion indicates
how well the model prediction replicates the variability in actual data. These two
proportions should be as small as possible (less than 0.2). The covariance proportion
measures the remaining error and therefore, should be close to one. Generally, if bias and
variance proportions are small, then the covariance proportion will be accordingly close to
one.
6.3 Sensitivity analysis
After introducing the performance metrics, we first do a sensitivity analysis, which aims at
hyperparameter fine-tuning. The objective is to find the best hyperparameter combination
resulting in the highest accuracy. We follow a sequential process changing one
hyperparameter at a time. Once we find the optimal value, we deal with the next
hyperparameter. In this section, we mainly focus on three hyperparameters (i.e. epoch,
optimizer, and batch size) which are the most important ones in our model structure.
6.3.1 Epoch
Epoch an iteration constituting one forward pass and one backward pass. In other words,
epoch is an iteration in the training process. As the training epoch increases, the validation
error (i.e. MSE in this thesis) of the model should decrease accordingly, which is the
process of model learning.
Firstly, to investigate the impact of the training epoch on the predictive performance,
the epoch is increased from 0 to 100 and the validation error (i.e. MSE) after each training
epoch is recorded for the proposed model. From Fig. 8 (i.e. initial training without any
34
modification of model structure), we can see that the validation error of the MSI-STNN
model decreases slowly in the initial 60 epochs and then drops sharply to around 130, which
indicates that the MSI-STNN model has a low learning speed which is expected since the
model structure is complicated (i.e., four parallel submodules) and the dataset is large (i.e.,
one-year training set and three months testing set). The computational time is around 100
seconds per epoch. The MSE reduces slightly from epoch 60 and epoch 100, changing from
around 130 to 118, while the computation time increases by 4,000 seconds. Considering
the trade-off between performance and computation time, the best training epoch for this
model is around 60, which sacrifices a little predictive performance to obtain better
computational performance.
Figure 8 Initial training process with 100 epochs
35
Based on the results, we have modified the structure of the model (i.e. as shown in
Fig. 9) since it took too much time to converge in our initial settings. In the new model
structure, we have removed batch normalization and dropout layers between Conv-LSTM
layers, which makes our model less complicated and faster to converge.
Figure 9 Framework of the modified MSI-STNN model
6.3.2 Optimizer
Optimizers are algorithms or methods used to change the attributes of the neural network
such as weights and learning rate in order to reduce the losses. How to change weights or
learning rates of the neural network to reduce the losses is determined by the optimizers
that we use. Optimization algorithms or strategies are responsible for reducing the losses
and for providing the most accurate results possible.
There are various optimizers used in deep learning. In this thesis, we consider three
types: Stochastic Gradient Descent (SGD), Adagrad, and Adaptive Moment Estimation
(Adam). Pros and cons of each optimizer are summarized in Table. 2.
36
Table 2 Comparison of various optimizers Optimizer Pros Cons SGD Adagrad Adam
-converge in less time -require less memory -automatically changeable learning rate -able to train on sparse data -converge rapidly -rectify vanishing learning rate, high variance
-high variance -slow learning rate reduction -expensive computation -learning rate decreasing in slow training -expensive computation
SGD is a variant of Gradient Descent. It tries to update the model’s parameters
more frequently. In this case, the model parameters are altered after computation of the
loss on each training example. So, if the dataset contains 1000 cases SGD will update the
model parameters 1000 times in one epoch instead of one time as in Gradient Descent. As
the model parameters are frequently updated, they will have high variance and fluctuations.
One of the disadvantages of optimizers is that the learning rate is constant for all
parameters and for each epoch. The learning rate of Adagrad optimizer is changeable. The
Adagrad optimizer changes the learning rate ‘η’ for each parameter and at every time
step. It is a second order optimization algorithm. It works with the derivative of the error
function. It achieves big updates for less frequent parameters and a small step for frequent
parameters.
Adam is an adaptive learning rate optimization algorithm which has been designed
specifically for training deep neural networks. The algorithm leverages the power of
adaptive learning rate methods to find individual learning rates for each parameter. It also
has the advantages of Adagrad, which works well in settings with sparse gradients, but
struggles in non-convex optimization of neural networks, and RMSprop, which works well
in on-line settings. Adam’s popularity has been growing exponentially recently.
Fig. 10 shows the training process using different optimizers. In this case, we only
focus on the computation time since our dataset is large, and the primary objective is to get
efficiency in the training process. We can see that the model converges rapidly when
adopting the Adam optimizer compared to the other two optimizers. Moreover, both SGD
and Adagrad experience two decays, which means that they converge in the first decay
37
while the overfitting problem shows up in the second decay. Overall, Adam is a better
optimizer for our model compared to SGD and Adagrad. And we also adopt Adam
optimizer in our initial model structure (i.e. as shown in Fig. 4).
Figure 10 Training process using different optimizers
6.3.3 Batch size
The batch size is a hyperparameter that defines the number of samples to use before
updating the internal model parameters. Think of a batch as a for-loop iterating over one
or more samples and making predictions. At the end of the batch, the predictions are
compared to the expected output variables, and the error is calculated. From this error, the
update algorithm is used to improve the model. A training dataset can be divided into one
or more batches. The size of the batch may be small if we use many batches.
In this thesis, we compare the training processes and performance (i.e. MSE and
TIC values) for batch sizes of 32, 64, 128 and 256. As shown in Fig. 11, for all batch sizes
the model converges rapidly (since we adopt Adam as the optimizer). But in the test set the
batch size of 32 shows some volatility early on. It makes sense because we have more
38
samples if the batch size is small, and it will consume more computation time to learn on
those samples, which causes uncertain direction for the loss decay. In other words, the
training process will be more stable if we adopt larger batch sizes.
Figure 11 Training using different batch sizes
Table 3 Performance comparison of various batch sizes Batch size MSE TIC (bias) TIC (variance) 32 64 128 256
119.823 109.762 114.904 137.296
0.40 0.21 0.49 0.09
0.43 0.38 0.63 0.14
While from a performance point of view (i.e. shown in Table. 3), we can see that the batch
size of 64 has the lowest MSE. And the value is much smaller than that of any other batch
size. For a batch size of 128, both the bias and variance of the TIC decompositions are high
(i.e. 0.49 and 0.64), which implies that the performance, when using 128 as the batch size,
is not good. Although the TIC values for a batch size of 256 are the smallest compared to
others (i.e. 0.09 and 0.14), this batch size has the largest value of MSE (i.e. 137.296).
Moreover, in the test set the batch size of 256 also shows some volatility. So, the batch size
39
should not be too large. With large batch sizes, the direction of the loss decay is unstable,
which leads to local optimum values but not the ultimate optimal one. Overall, 64 is the
most suitable batch size for our model.
6.3.4 Summary
Using sensitivity analysis, we mainly tuned three hyperparameters: epoch, optimizer, and
batch size. Firstly, we change the epoch from 0 to 100 using the initial model structure.
The validation error hardly goes down until about 50 epochs. And then, it sharply
converges at around epoch 60, which may indicate that the model structure is not suitable.
We made some modifications to the model before tuning the optimizer. We compare three
types of optimizers: SGD, Adagrad, and Adam. The ‘Adam’ optimizer is the fastest to
converge, which means that it is the most efficient. Lastly, we consider the impact of
various batch sizes. Using the MSE value and the TIC decompositions (i.e. bias and
variance), we concluded that a batch size of 64 is most suitable for our model.
6.4 Performance diagnostics
In this section, we present two comparisons to test the performance of the modified model
structure and adjusted hyperparameters. One is the comparison with some baseline models
(i.e. ARIMA, RNN and Conv-LSTM with pick-up demand) and the proposed model. The
other one is to evaluate the performance using K-fold cross-validation. We evaluate the
models via two measurements of effectiveness: MSE and TIC decompositions (i.e. bias,
and variance).
6.4.1 Comparison between baselines and MSI-STNN
In this thesis, we introduce ARIMA, RNN and Conv-LSTM with only pick-up demand
data as the benchmark models. By comparing our model against three standard approaches,
we will be able to draw a conclusion about the value of our formulation and the
incorporation of the multi-source information and spatiotemporal attributes.
40
Table. 4 lists the predictive performance of the proposed model and the three
baselines on the testing set. It can be found that the proposed MSI-STNN outperforms the
benchmarks in the two measurements of predictive performance, which validates the
importance of multi-source and spatiotemporal attributes in demand forecasting.
Table 4 Performance comparison between the proposed model and baselines Model MSE TIC (bias) TIC (variance) ARIMA RNN Conv-LSTM MSI-STNN (modified)
137.230 116.304 111.357 108.320
0.35 0.38 0.18 0.12
0.43 0.45 0.23 0.16
6.4.2 Evaluation using K-fold cross validation
Next, we adopt the K-fold cross-validation to evaluate model performance. As mentioned
earlier, we have a large training set from January 1st, 2017 to December 31st, 2017 and the
testing set is from January 1st, 2018 to March 31st, 2018. If we split the dataset into seasons,
we only have the first season of 2018 as the testing set. In this case, we will encounter
seasonality problems, which has a great impact on the accuracy of the prediction outcome.
In order to eliminate this influence on the demand forecasting, we use the K-fold cross-
validation.
Cross-validation is a resampling procedure used to evaluate machine learning
models on a limited data sample. The procedure has a single parameter called k that refers
to the number of groups that a given data sample is to be split into. As such, the procedure
is often called k-fold cross-validation. When a specific value for k is chosen, it may be used
in place of k in reference to the model, such as k=10 becoming 10-fold cross-validation.
In this thesis, we set k=4. We select three scenarios as the training set and the
remaining one as a validation set (i.e. shown in Fig. 12). The model is trained four times
with the different training sets, and each fold is used as a validation set.
41
Figure 12 K-fold cross-validation (K=4, red block: training set; blue block:
validation set) Table. 5 shows the predictive performance of the originally proposed model without using
the K-fold cross-validation and the model using the K-fold cross-validation. Indeed, the
accuracy increases using the K-fold cross-validation. However, it does not improve a lot.
Moreover, the computation time is costly since our model will run four times with such
huge dataset. Overall, K-fold cross-validation can be adopted to validate a model, but it is
not used to improve its performance.
Table 5 Performance comparison between the proposed model and the model with K-fold cross-validation
Model MSE TIC (bias)
TIC (variance)
MSI-STNN (modified) MSI-STNN (modified) with K-fold cross validation
113.076 105.209
0.16 0.19
0.25 0.22
42
6.5 Data visualization
In the last section, we present both spatial and temporal visualizations between the ground
truth pick-up demand and predicted results in the test dataset.
6.5.1 Temporal visualization
The temporal characteristics are visualized for both predicted and real demand (Fig. 13).
In order to show the results in high resolution, we only show the average demand for all
the taxi zones against the first 1,000 timesteps (i.e. x1, x2, x3, …, x1000). Each timestep is
15 minutes, so the results are for 10 days and 10 hours. The red line stands for ground truth
pick-up demand, and the green line is the corresponding prediction outcome.
It is obvious that the prediction results match the actual data very well. We can see
that the two curves have the same trend, and the error is pretty small. Even in some
abnormal cases (e.g. demand between the 300th and 400th timesteps), our model can predict
precisely.
Figure 13 Temporal prediction outcome for the first 1,000 timesteps
43
6.5.2 Spatial visualization
Next, we visualize our data geographically. Fig. 14 shows the spatial characteristics of
prediction and ground truth. It is the heatmap of the average error distribution (i.e.
subtraction of prediction and ground truth) for all the timesteps in the whole Manhattan
area, where deeper color implies a larger error (i.e. deep red for positive error and deep
blue for negative error).
The accuracy of demand forecasting varies substantially across zones. For instance,
the demand forecasting for taxi zones in the upper town has a small error, but it can be
large for some zones around Central Park. Midtown Manhattan has heavy traffic compared
to the upper town, and this makes it challenging to forecast the taxi demand accurately in
those zones. Overall, the errors are small.
Figure 14 Average error distribution across zones for all the timesteps
44
Chapter 7
CONCLUSION
7.1 Contributions
In this thesis, we propose a novel deep learning approach, based on the fusion of data from
multiple sources through a spatiotemporal neural network (MSI-STNN), to predict the next
15 minutes’ taxi pick-up demand. The model uses historical demand information and
information of POI in the zone of interest and weather conditions.
We use taxi data in Manhattan, New York, as the source of pick-up and drop-off
demand. The popularity and weather data are mined from a Yelp dataset and the National
Oceanic and Atmospheric Administration (NOAA), respectively. We quantify the
popularity data by using the number of reviews in a location point of interest (POI). To the
best of our knowledge, this was used for the first attempt in taxi demand forecasting.
Our model (i.e. MSI-STNN) consists of four submodules, using convolutional long
short-term memory (Conv-LSTM), long short-term memory (LSTM), and convolutional
neural network (CNN) models. Two Conv-LSTMs capture the spatiotemporal
characteristics of pick-up and drop-off demand simultaneously, while the CNN and LSTM
models extract spatial and temporal information about zonal popularity and weather
conditions.
We evaluate the MSI-STNN performance through a case study. The performance
metrics include the mean squared error (MSE) and two decomposition terms of Theil’s
inequality coefficient (TIC): bias and variance. Through a sensitivity analysis, three
hyperparameters were finetuned: epoch, optimizer, and batch size. Based on the results,
the initial model structure was modified. For the optimizer, we evaluated three different
alternatives: Adam, Adagrad and SGD algorithms, Adam was selected since it was the
fastest to converge (i.e. it is the most efficient one). The most suitable batch size was 64
resulting in a relatively small value of MSE and TIC decompositions (i.e. bias and
variance). The model performance was validated by first comparing its output with state-
45
of-art time series and deep learning approaches, including ARIMA, LSTM, and Conv-
LSTM. The proposed MSI-STNN outperforms the benchmark algorithms. The results
highlight the importance of multi-source information in demand prediction.
7.2 Future research
Future work can focus on exploring even more advanced deep learning architectures to
fuse multi-source information as well as constraints among zonal pick-ups and drop-offs.
In our case study, we used pick-up and drop-off demand from Manhattan taxi data, which
is a perfect dataset without any missing values. In the future, we need to test the robustness
of the model using other datasets that may have missing values. We only consider three
types of hyperparameters. But there are some other hyperparameters that can be tuned (e.g.
loss function).
51
REFERENCES [1] John H. Cochrane. Time Series for Macroeconomics and Finance. 1997.
[2] K.W. Hipel and A.I. McLeod. Time Series Modelling of Water Resources and
Environmental Systems. 1994.
[3] C.F. Lee, J.C. Lee, and A.C. Lee. Statistics for Business and Financial Economics.
World Scientific Publishing Co. Pte. Ltd, 1999.
[4] P.J. Harrison and C.F. Stevens. Bayesian Forecasting. Journal of the Royal Statistical
Society. Series B (Methodological), Vol. 38, No. 3, 1976, pp. 205-247.
[5] M.S. Ahmed and A.R. Cook. Analysis of freeway traffic time-series data by using box-
jenkins techniques, 1979.
[6] I. Okutani and Y.J. Stephanedes. Dynamic prediction of traffic volume through Kalman
filtering theory.
[7] R.O. Otieno J.M. Kihoro and C. Wafula. Seasonal time series forecasting: A
comparative study of arima and ann models. African Journal of Science and Technology
(AJST) Science and Engineering Series, 2004.
[8] P.M. Yelland. Bayesian forecasting of parts demand. International Journal of
Forecasting, 2010, pp. 374-396.
[9] B.M. Williams and L.A. Hoel. Modeling and forecasting vehicular traffic flow as a
seasonal ARIMA process: theoretical basis and empirical results. J. Transp. Eng. Vol. 129,
No. 6, 2003, pp. 664–672.
[10] J. Guo, W. Huang and B.M. Williams. Adaptive Kalman filter approach for stochastic
short-term traffic flow rate prediction and uncertainty quantification. Transportation
Research Part C: Emerging Technologies. Vol. 43, No. 1, 2014, pp. 50–64.
[11] E. Alpaydin. Introduction to Machine Learning. MIT Press, 2004.
[12] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning.
52
[13] W.S. McCulluoch and W.H. Pitts. A logical calculus of the ideas immanent in nervous
activity. The Bulletin of Mathematical Biophysics, Vol. 5, 1943, pp. 115-133.
[14] L. Breiman. Random Forest. Machine Learning, Vol. 45, No. 1, 2001, pp. 5-32.
[15] L. Zhang, Q. Liu, W. Yang, N. Wei and D. Dong. An Improved K-nearest Neighbor
Model for short-term Traffic Flow Prediction. Procedia-Social and Behavioral Sciences,
Vol. 9, No. 6, 2013, pp. 653-662.
[16] E.I. Vlahogianni, M.G. Karlaftis and J.C. Golias. Optimized and meta-optimized
neural networks for short-term traffic flow prediction: a genetic approach. Transportation
Research Part C, Vol. 13, No. 3, 2005, pp. 211–234.
[17] J. Kamruzzaman, R. Begg, and R. Sarker. Artificial Neural Networks in Finance and
Manufacturing. Idea Group Publishing, 2006.
[18] G. Zhang, B.E. Patuwo, and M.Y. Hu. Forecasting with artificial neural networks: The
state of the art. International Journal of Forecasting,1998.
[19] A. Azadeh and Z.S. Faiz. A meta-heuristic framework for forecasting household
electricity consumption. Applied Soft Computing, 2001.
[20] A.C. de Pina and G. Zaverucha. Combining attributes to improve the performance of
naive bayes for regression. IEEE World Congress on Computational Intelligence, 2008.
[21] A. Al-Smadi and D. M. Wilkes. On estimating arma model orders. IEEE International
Symposium on Circuits and Systems, 1996.
[22] X. Li, L. Ding, M. Shao, G. Xu, and J. Li. A novel air-conditioning load prediction
based on arima and bpnn model. Asia-Pacific Conference on Information Processing,
2009.
[23] S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation,
Vol. 9, No. 8, 1997, pp. 1735-1780.
[24] J. Chung, C. Gulcehre, K.H. Cho and Y. Bengio. Empirical Evaluation of Gated
Recurrent Neural Networks on Sequence Modeling, 2014.
[25] X. Ma, Z. Tao, Y. Wang, H. Yu and Y. Wang. Long Short-term Memory Neural
Network for Traffic Speed Prediction using Remote Microwave Sensor Data.
Transportation Research Part C: Emerging Technologies, Vol. 54, 2015, pp. 187-197.
53
[26] H. Yu, Z. Wu, S. Wang, Y. Wang and X. Ma. Spatiotemporal Recurrent Convolutional
Networks for Traffic Prediction in Transportation Networks. Sensors, Vol. 17, No. 7, 2017,
pp. 1501.
[27] N.G. Polson and V.O. Sokolov. Deep Learning for Short-term Traffic Flow Prediction.
Transportation Research Part C: Emerging Technologies, Vol. 79, 2017, pp. 1-17.
[28] X. Shi, Z. Chen, H. Wang and D.Y. Yeung. Convolutional LSTM Network: A
Machine Learning Approach for Precipitation Nowcasting, 2015.
[29] X. Ma, Z. Dai, Z. He, J. Ma, Y. Wang and Y. Wang. Learning Traffic as Images: A
Deep Convolutional Neural Network for Large-Scale Transportation Network Speed
Prediction. Sensors, Vol. 17, No. 4, 2017, pp. 818.
[30] J. Ke, H. Zheng, H. Yang and X. Chen. Short-term Forecasting of Passenger Demand
Under On-demand Ride Services: A Spatio-temporal Deep Learning Approach.
Transportation Research Part C, Vol. 85, 2017, pp. 591-608.
[31] J. Bao, P. Liu and S.V. Ukkusuri. A Spatiotemporal Deep Learning Approach for
Citywide Short-term Crash Risk Prediction with Multi-source Data. Accident Analysis and
Prevention, Vol. 122, 2019, pp. 239-254.
[32] B. Cule, B. Goethals, S. Tassenoy, and S. Verboven. Mining train delays. Advances
in Intelligent Data Analysis X, ser. LNCS vol. 7014, pages 113-124, 2011.
[33] D.M. Kline. Methods for multi-step time series forecasting with neural networks.
Information Science Publishing, pages 226-250, 2004.
[34] X. Li, G. Pan, Z. Wu, G. Qi, S. Li, D. Zhang, W. Zhang, and Z. Wang. Prediction of
urban human mobility using large-scale taxi traces and its applications. Frontiers of
Computer Science in China, pages 111-121, 2012.
[35] M.C. Gonzalez, C.A. Hidalgo, and A.-L. Barabasi. Understanding individual human
mobility patterns. Nature, pages 779-782, 2008.
[36] J. Gama and P. Rodrigues. Stream-based electricity load forecast. Knowledge
Discovery in Databases: PKDD, pages 446-453, 2007.
[37] B. Williams and L. Hoel. Modeling and forecasting vehicular traffic flow as a seasonal
arima process: Theoretical basis and empirical results. Journal of Transportation
Engineering, pages 664-672, 2003.
54
[38] K. Wong, S. Wong, M. Bell, and H. Yang. Modeling the bilateral micro-searching
behavior for urban taxi services using the absorbing markov chain approach. Journalof
Advanced Transportation39(1), pages 81-104, 2005.
[39] L. Moreira-Matias, J. Gama, M. Ferreira, and L. Damas. A predictive model for the
passenger demand on a taxi network. 15th International IEEE Conference on Intelligent
Transportation Systems (ITSC), pages 1014-1019, 2012.
[40] L. Liu, C. Andris, A. Biderman, and C. Ratti. Uncovering taxi drivers mobility
intelligence through his trace. IEEE Pervasive Computing 160, pages 1-17, 2009.
[41] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang. T-drive: driving
directions based on taxi trajectories. Proceedings of the 18th SIGSPATIAL International
Conference on Advances in Geographic Information Systems, ACM, pages 99-108, 2010.
[42] B. Li, D. Zhang, L. Sun, C. Chen, G. Qi S. Li, and Q. Yang. Hunting or waiting?
discovering passenger-finding strategies from a large-scale real-world taxi dataset. 2011
IEEE International Conference on Pervasive Computing and Communications Workshops
(PERCOM Workshops), pages 63-68, 2011.
[43] L. Moreira-Matias, J. Gama, M. Ferreira, and L. Damas. A predictive model for the
passenger demand on a taxi network. 15th International IEEE Conference on Intelligent
Transportation Systems (ITSC), pages 1014-1019, 2012.
[44] J. Cryer and K. Chan. Time Series Analysis with Applications. R. Springer, 2008.
[45] A. Ihler, J. Hutchins, and P. Smyth. Adaptive event detection with time-varying
poisson processes. Proceedings of the 12th ACM SIGKDD international conference on
Knowledge discovery and data mining, ACM, pages 207-216, 2006.
[46] T. Toledo and H.N. Koutsopoulos. Statistical validation of traffic simulation models.
Transportation Research Record: Journal of the Transportation Research Board, pages
142-150, 2004.
[47] V.S. Bawa. Basic architecture of RNN and LSTM.
https://pydeeplearning.weebly.com/blog/basic-architecture-of-rnn-and-lstm
[48] C. Olah. Understanding LSTM networks. https://colah.github.io/posts/2015-08-
Understanding-LSTMs/