MULTI-SOURCE INFORMATION BASED SHORT-TERM TAXI PICK- …m... · 2 days ago · network (MSI-STNN) deep learning architecture to predict short-term taxi pick-up demand. It fuses pick-up

MULTI-SOURCE INFORMATION BASED SHORT-TERM TAXI PICK-UP DEMAND PREDICTION USING DEEP-LEARNING APPROACHES

A Thesis Presented

By

Ziyan Chen

to

The department of Civil and Environmental Engineering

in partial fulfillment of the requirements

for the degree of

Master of Science

In the field of

Transportation

Northeastern University

Boston, Massachusetts

May 2020

ii

ACKNOWLEDGMENTS I would like to express my immerse gratitude to my advisors and friends who contributed

to this thesis. First of all, I would like to thank my thesis supervisor, Prof. Haris N.

Koutsopoulos, for his excellent guidance and unwavering support all the time. His

insightful advices greatly helped me during the research and writing of this thesis.

I am indebted to my instructor, Dr. Zhenliang (Mike) Ma. Mike is always patient,

enthusiastic and caring for us students. Besides valuable guidance on my academy and

research, his understanding and support gave me great confidence to overcome the

difficulties during the research journey.

Thanks to numerous people who indirectly contributed to this thesis. They are the

Ph.D. students and visiting students of Prof. Haris N. Koutsopoulos, as well as the Ph.D.

student in Monash University, Bo Wang.

Finally, I want to give my deepest gratitude and appreciation to my family. I am

lucky to have my parents, Yu Chen and Huiyun Jiang, who brought me up and taught me

honesty, kindness and responsibility. I can’t get a chance to study here in Northeastern

University from my homeland, China without their support.

iii

ABSTRACT Short-term demand prediction is of great importance to the on-demand ride-hailing services.

Predicted demand information can facilitate efficient operations and improve service

performance. This thesis proposes a multi-source information based spatiotemporal neural

network (MSI-STNN) deep learning architecture to predict short-term taxi pick-up demand.

It fuses pick-up and drop-off time-series data, weather information, location popularity

data, using three deep-learning models, including stacked convolutional long short-term

memory (Conv-LSTM) model, stacked long short-term memory (LSTM) model, and a

convolutional neural network (CNN) model. Conv-LSTM captures the spatiotemporal

features of pick-up and drop-off time series. LSTM extracts weather information while

CNN incorporates popularity data. A case study is performed to predict short-term pick-up

demand at zonal levels 15 minutes using Manhattan, New York taxi data. The results

validate the superiority of the proposed approach compared with state-of-art time-series

and deep learning approaches, including ARIMA, LSTM and Conv-LSTM.

Key words: taxi pick-up demand, short-term prediction, deep learning, multi-source

information

iv

TABLE OF CONTENTS

ABSTRACT ....................................................................................................................... iii 1. INTRODUCTION .................................................................................................. 1

1.1 Background ......................................................................................................... 1 1.2 Problem statement ............................................................................................... 2

1.3 Motivation ........................................................................................................... 2 1.4 Thesis outline ...................................................................................................... 3

2. LITERATURE REVIEW ........................................................................................... 4 2.1 Statistical models ................................................................................................ 4

2.2 Machine learning methods .................................................................................. 6 2.3 Deep learning approaches ................................................................................... 7

2.4 Predictive models for taxi demand ..................................................................... 8 2.5 Summary ............................................................................................................. 9

3. DATA SOURCES .................................................................................................... 10 3.1 Pick-up and drop-off requests ........................................................................... 10

3.2 Exogenous data ................................................................................................. 10 3.2.1 Weather ........................................................................................................ 11 3.2.2 Location popularity ...................................................................................... 11

4. DEEP LEARNING MODELS .................................................................................. 13

4.1 Convolutional neural networks (CNN) ............................................................. 13 4.2 Recurrent neural networks (RNN) .................................................................... 15

4.3 Long-short term memory networks (LSTM) .................................................... 16 4.4 Convolutional long-short term memory networks (Conv-LSTM) .................... 18

4.5 Summary ........................................................................................................... 19 5. METHODOLOGY ................................................................................................... 21

5.1 Preliminaries ..................................................................................................... 21 5.1.1 Taxi zones and time partition ....................................................................... 22

v

5.1.2 Pick-up and drop-off demand ...................................................................... 22 5.1.3 Weather information .................................................................................... 22 5.1.4 Popularity ..................................................................................................... 23 5.1.5 Problem formulation .................................................................................... 24

5.2 Multi-source information based spatiotemporal neural network (MSI-STNN) 24 5.2.1 Structure of spatial variables ........................................................................ 25 5.2.2 Structure of temporal variables .................................................................... 25 5.2.3 Structure of spatiotemporal variables .......................................................... 26 5.2.4 Information fusion ....................................................................................... 27 5.2.5 Objective function ........................................................................................ 27

6. CASE STUDY .......................................................................................................... 29

6.1 Study site and dataset ........................................................................................ 29 6.2 Performance metrics ......................................................................................... 32

6.2.1 Mean squared error (MSE) .......................................................................... 32 6.2.2 Theil’s inequality coefficient (TIC) ............................................................. 32

6.3 Sensitivity analysis............................................................................................ 33 6.3.1 Epoch ........................................................................................................... 33 6.3.2 Optimizer ..................................................................................................... 35 6.3.3 Batch size ..................................................................................................... 37 6.3.4 Summary ...................................................................................................... 39

6.4 Performance diagnostics ................................................................................... 39 6.4.1 Comparison between baselines and MSI-STNN ......................................... 39 6.4.2 Evaluation using K-fold cross validation ..................................................... 40

6.5 Data visualization .............................................................................................. 42 6.5.1 Temporal visualization ................................................................................. 42 6.5.2 Spatial visualization ..................................................................................... 43

7. CONCLUSION ......................................................................................................... 44

7.1 Contributions .................................................................................................... 44 7.2 Future research .................................................................................................. 45

REFERENCES ................................................................................................................. 51

vi

LIST OF TABLES Table 1 Comparison of basic deep learning models ......................................................... 20 Table 2 Comparison of various optimizers ....................................................................... 36 Table 3 Performance comparison of various batch sizes .................................................. 38 Table 4 Performance comparison between the proposed model and baselines ................ 40 Table 5 Performance comparison between the proposed model and the model with K-fold cross-validation ................................................................................................................. 41

vii

LIST OF FIGURES Figure 1 Illustration of a typical CNN architecture .......................................................... 14 Figure 2 Illustration of a typical RNN architecture (adopted from [47]) ......................... 15 Figure 3 Illustration of a standard LSTM cell (adopted from [48]) .................................. 17 Figure 4 Framework of the proposed MSI-STNN model ................................................. 25 Figure 5 Taxi zones in Manhattan .................................................................................... 30 Figure 6 Daily pick-up demand in January 2018 .............................................................. 31 Figure 7 Spatial distribution of pick-up demand on January 26th, 2018 between 6 p.m. and 7 p.m. .......................................................................................................................... 31 Figure 8 Initial training process with 100 epochs ............................................................. 34 Figure 9 Framework of the modified MSI-STNN model ................................................. 35 Figure 10 Training process using different optimizers ..................................................... 37 Figure 11 Training using different batch sizes ................................................................. 38 Figure 12 K-fold cross-validation (K=4, red block: training set; blue block: validation set)........................................................................................................................................... 41 Figure 13 Temporal prediction outcome for the first 1,000 timesteps ............................. 42 Figure 14 Average error distribution across zones for all the timesteps ........................... 43

1

Chapter 1

INTRODUCTION

1.1 Background

Taxi service is of great importance in urban transportation for its flexibility and

convenience. Commuting from one spot to another is necessary and unavoidable in

people’s daily life. Moreover, different means of public transportation (e.g. trains, buses,

taxis) make it easier for traveling between areas. Taxi is an end-to-end travel mode. It is

not the first choice for most commuters, though, because of the cost, possible waiting time

and the probability to encounter congestion. Almost every taxi company is interested in

exploring solutions to optimize their operations and balance demand and supply.

The new era of transportation is multidisciplinary and provides opportunities to

combine transportation with other fields such as Geographic Information System (GIS).

GIS significantly aids in planning, monitoring and managing complex systems involved in

transportation planning and management more efficiently. The use of GIS in transportation

is widespread: traffic modeling, accident analysis, route planning and highway

maintenance. Benefited from GPS technologies, location data (e.g. for taxis) can be

captured. For instance, a taxicab can be tracked with multiple information including real-

time location, unique ID, trip distance and even the fare amount.

In recent years, fast-growing smartphone apps have been changing the way people

travel in urban areas. Instead of hailing a taxi on street, people are able to make a

reservation on smartphones. Once the driver gets the request from a passenger, he/she can

receive the specific location of this passenger and choose the optimal route. This interaction

between drivers and passengers reduces the waiting time and fuel costs which benefits both

sides.

With the availability of huge amounts of taxi data, the opportunity exists to develop

methods to predict, in real time, future requests, a capability that can enhance the ability to

schedule service in more efficient and sustainable ways.

2

1.2 Problem statement

Predictive data analytics facilitates proactive decision support for both operators and

travelers in transportation. The demand forecasting problem has been widely studied in the

last decades, and a vast amount of methods exist, from model-based time series analysis to

machine learning and model-free deep learning techniques.

Traditional approaches focus on applying complex mathematical models to predict

travel demand. But they are inefficient and inaccurate compared to state-of-art machine

learning and deep learning models. Moreover, traditional time series analysis is not capable

of dealing with the large-scale datasets that are becoming available.

Taxi demand requests in urban areas vary by zone, population density, weather

conditions, and so on, which makes it complicated to forecast taxi demand. The thesis aims

to predict the real-time zonal demand (e.g. 15 minutes ahead) of taxi travels by fusing

multi-source information.

1.3 Motivation

Machine learning and deep learning techniques are popular in the latest research. In this

thesis, the goal is to develop a multi-source information based spatiotemporal neural

network (MSI-STNN) approach to fuse pick-up, drop-off and exogenous variables

simultaneously. To be specific, pick-up and drop-off data can be mined from available taxi

datasets and combined with exogenous variables, such as weather and area popularity data.

After preparing the data of pick-up demand, drop-off demand, weather conditions and

popularity, a stacked convolutional long short-term memory (Conv-LSTM) model is

applied to extract the spatiotemporal features of pick-up and drop-off demand

simultaneously rather than combining long short-term memory (LSTM) and convolutional

neural network (CNN) to acquire spatial and temporal characteristics separately. The

stacked LSTM is adopted to predict weather conditions, and CNN is used to extract zonal

popularity features among zones.

The objective of this novel model structure is to forecast the taxi pick-up demand

in next 15 minutes across all taxi zones given the multi-source data in the previous hour.

3

Our model was trained and tested with taxi data in Manhattan from 2017 to 2018.

Specifically, the training set was from the whole year of 2017, while the testing set was the

first three months of 2018. Hyperparameters of the model were fine-tuned to improve

accuracy. The proposed model results were compared with baselines under different

conditions.

1.4 Thesis outline

This thesis is organized as follows:

Chapter 2 overviews the statistical models and its application to the taxi demand

prediction problem. Machine learning methods are introduced in section 2.2. And state-of-

art deep learning approaches are reviewed and compared in section 2.3. Moreover, previous

predictive models for taxi demand are described in section 2.4.

Chapter 3 introduces four data sources with different information. The taxi pick-up

and drop-off requests are described in section 3.1. Exogenous data, used to improve the

performance of our model, are discussed in section 3.2. Multiple deep learning approaches

are used to extract information based on the characteristics of different data sources.

In chapter 4, deep learning approaches for demand forecasting are presented. There

are many variants of deep learning models. In this chapter, we focus on the basic ones and

specifically their use to the demand forecasting problem. Convolutional neural networks

(CNN), recurrent neural networks (RNN), long-short term memory networks (LSTM) and

convolutional long-short term memory networks (Conv-LSTM) are discussed.

In chapter 5, a novel deep learning model is proposed. We first pre-process the raw

datasets to prepare the inputs of our model. Then, we build a four-layer deep learning model

in section 5.2, including two Conv-LSTM models for pick-up and drop-off requests, an

LSTM model for weather data, a CNN model for location popularity.

Chapter 6 introduces a case study in which the model is evaluated with the taxi

pick-up requests in Manhattan. The chapter presents the study area and dataset in section

6.1, performance metrics in section 6.2, sensitivity analysis in section 6.3, performance

diagnostics in section 6.4 and data visualization of pick-up requests in section 6.5.

Chapter 7 concludes the thesis.

4

Chapter 2

LITERATURE REVIEW

This chapter reviews the evolution of time series modeling. In section 2.1, we first discuss

the basics of statistical models because a time series is essentially one of the techniques in

statistical analysis. In the past decades, machine learning methods were considered to be

the fundamental methodologies to solve time series forecasting problems, as introduced in

section 2.2. Recently, state-of-art deep learning approaches have become mainstream in

the data-driven analysis. In section 2.3, some basic deep learning models for time series

analysis are reviewed.

2.1 Statistical models

Statistical modeling is a simplified, mathematically formalized way to approximate reality

(i.e. what generates your data) and optionally to make predictions from this approximation.

Its application initially started in physics and is now being applied in finance, engineering,

social sciences, etc.

Time series is one of the techniques under statistical analysis, which is typically

measured over successive times, representing a sequence of data points [1]. The

measurements taken during an event in a time series are arranged in proper chronological

order. Basically, there are two types of time series: continuous and discrete. In a

continuous-time series, observations are measured at every instance of time, whereas a

discrete-time series contains observations measured at discrete points of time. Usually, in

a discrete-time series, the consecutive observations are recorded at equally spaced time

intervals such as hourly, daily, monthly, or yearly time separations. In general, to do further

analysis, the data being observed in a discrete-time series is assumed to be as a continuous

variable using the real number scale [2].

Time series analysis fits a time series into a proper model. The procedure of fitting

a time series to a proper model is referred to as Time Series Analysis. Practically,

5

parameters of the model are estimated using the known data values, which comprises

models that attempt to analyze and understand the nature of the series. These models are

useful for future simulation and forecasting after beings validated. A time series, in general,

is assumed to be affected by four main components: trend, cyclical, seasonal and irregular

components [3]. These four components can be extracted and separated from the observed

data. Considering the effects of these four components, in general, additive and

multiplicative models are used for a time series decomposition. An additive model is based

on the assumption that the four components of a time series are independent of each other.

However, in a multiplicative model, the four components can affect the others, meaning

they are not necessarily independent.

Traditional time series analysis leverages statistical models, which can predict

future values by giving the successive historical records. For example, Bayesian

Forecasting [4], the autoregressive integrated moving average (ARIMA) model [5], and

Kalman filter [6] are the most classic ones.

A Bayesian Forecasting model is essentially a dynamic linear model. The Bayesian

approach, in general, requires the explicit formulation of a model, and conditioning on

known quantities, in order to draw inferences about unknown ones. In Bayesian forecasting,

one simply takes a subset of the unknown quantities to be future values of some variables

of interest.

The autoregressive integrated moving average (ARIMA) model is a generalization

of the simpler autoregressive moving average (ARMA) [7]. It is a form

of regression analysis that gauges the strength of one dependent variable relative to other

changing variables. The model's goal is to predict future values by examining the

differences between values in the series instead of through actual values. And there are

seasonal and non-seasonal ARIMA models that can be used for forecasting.

In statistics and control theory, Kalman filtering, also known as linear quadratic

estimation (LQE), is an algorithm that uses a series of measurements observed over time,

containing statistical noise and other inaccuracies, and produces estimates of unknown

variables that tend to be more accurate than those based on a single measurement alone, by

estimating a joint probability distribution over the variables for each timeframe.

6

These algorithms have been used in various applications, such as in supply chains

[8] and traffic flow prediction [9-10]. However, these algorithms rely on a specific

mathematical model, describing the underlying process, which should be linear and have a

Gaussian distribution. This limitation is inconsistent with complicated characteristics of

data from various sources. Typically, these statistical models have been proved to be

efficient in simple stationary time series problems, which do not take multi-source data

into account.

2.2 Machine learning methods

Machine learning methods were widely applied in demand forecasting in various domains

in the last decade. Examples of such methods include support vector machines, decision

trees, and others [11]. Some of these methods are based on classic statistical approaches

[12].

Methods such as artificial neural network [13] and random forest [14] have made

great contributions to the analysis of time series in transportation. For example, an

improved K-nearest neighbors [15] and optimized artificial neural networks (ANN) [16]

were adopted in traffic flow prediction. And the accuracy of these advanced models can be

90 percent or so.

Artificial neural networks (ANN) approaches have been suggested for time series

forecasting and gained popularity in the last few years. ANNs were built on a model of the

human brain [17]. Although the development of ANNs was mainly biologically motivated,

they have been applied in various areas, especially for classification and forecasting

purposes [7]. ANNs try to recognize essential patterns and regularities in the input data,

learn from experience and then provide generalized results based on the previous

knowledge. Among ANNs models, the most used ones in forecasting problems are multi-

layer perceptron, which uses a single hidden layer feed-forward network [18]. The model

is characterized by a network of three layers connected by acyclic links: input, hidden and

output layers.

A major breakthrough in the area of time series forecasting occurred with the

development of support vector machines (SVM) [19-20]. The initial aim of SVM was to

7

solve pattern classification problems and they have been applied in many other fields such

as function estimation, regression, and time-series prediction problems [21]. A

characteristic of SVM is that it aims at a better generalization of the training data. With

SVMs, instead of depending on the whole data set, the solution usually only depends on a

subset of the training data points, called the support vectors [22]. Furthermore, with the

help of support vector kernels, the input points in SVM applications are usually mapped to

a high dimensional feature space, which often generates good generalization outcomes. For

this reason, the SVM methodology has become a technique used for time series forecasting

problems.

The outcome of machine learning algorithms depends on the input, i.e., feature

selection is necessary. Furthermore, hyperparameters should be manually calibrated to

yield the best prediction. Another shortcoming of machine learning models is their

computational requirements.

2.3 Deep learning approaches

Deep Learning is a subfield of machine learning concerned with algorithms inspired by the

structure and function of artificial neural networks (ANN). Essentially, deep learning

models are extensions of ANN with multiple hidden layers.

Recently, advances in deep learning algorithms and their implementation have been

approaches in transportation, especially in demand forecasting. Studies have shown the

advantages of recurrent neural network (RNN) and its variants, i.e., long short-term

memory (LSTM) [23] and gated recurrent units (GRU) [24], in time series analysis. [25]

employed long short-term memory neural network (LSTM) to capture temporal

characteristics with optimal time lags for traffic speed prediction. [26] applied LSTM to

describe temporal relations for traffic state prediction, and it also added an autoencoder to

deal with extreme situations, e.g., peak hour and traffic accidents. [27] considered sharp

nonlinearities, e.g., transitions, breakdown, recovery and congestion, as major effects in

the performance of traffic flow prediction, and based on this, they put forward a combined

ℓ! regularization and a sequence of tanh layers to capture the nonlinearities. However, the

8

models mentioned above fail to consider spatial characteristics, which is an endogenous

dependency in zonal based demand forecasting.

CNNs have shown potential in computer vision. Various pacers employ CNN for

traffic prediction. Typically, the area is seen as an image, and local features are extracted

using CNN. [28] proposed a CNN based architecture to predict traffic speed. In that paper,

they treat the traffic as images. CNN is applied to extract the features of a vehicle trajectory

in each road segment, from which the vehicle speed can be estimated.

These methods do not pay much attention to temporal correlations since they

simply fuse the data extracted by CNN. Later, researchers found a suitable solution to

combine spatial and temporal characteristics together, i.e., convolutional long short-term

memory models (Conv-LSTM), first introduced in [29]. For example, [30] employed

Conv-LSTM to handle travel time and demand intensity. [31] applied Conv-LSTM to

extract the spatiotemporal features of crash risk and taxi trips. The experimental outcomes

indicate that Conv-LSTM is more reliable and efficient in terms of dealing with data that

has both time dependencies and spatial discrepancies. This method has not been widely

adopted in transportation applications.

2.4 Predictive models for taxi demand

In the last decade, GPS-location systems have attracted the attention of both researchers

and companies due to the new type of information they provide. Specifically, location-

aware sensors and information transmitted can be used to track human mobility patterns.

Rail [32], bus [33] and taxi [34] applications are already successfully exploring these traces.

Gonzalez et. al [35] uncovered the spatiotemporal regularity of human mobility. Similar

patterns can be found in other activities such as electricity load [36] and freeway traffic

[37].

In recent years, Uber and Lyft have become two popular taxi companies using

location-based services with reduced waiting time for both passengers and drivers. For

these services, the imbalance between demand and supply can reduce profits and level of

service. Wong presented a relevant mathematical model to express this need for

equilibrium in distinct contexts [38]. Lack of equilibrium may result in one of two scenarios:

9

(1) excess of vacant vehicles and excessive competition; (2) larger waiting times for

passengers and lower service reliability [39].

Knowledge about where the demand will actually emerge can be an advantage for

the drivers. The GPS historical data is one of the main variables of this topic because it can

reveal underlying mobility patterns. This kind of data represents a new opportunity to learn

relevant patterns while the network is operating in real-time.

Several researchers have already explored this data successfully with distinct

applications like modeling the spatiotemporal structure of taxi services [40], smart driving

[41] and building intelligent passenger-finding strategies [42]. Despite their useful insights,

the majority of techniques reported are based on offline tests, discarding some of the main

advantages of the real-time information. In other words, they do not provide any online

information about expected future or the best place to pick up passengers in real-time. One

of the recent advances on this topic was presented by Moreira Matias [43], where a discrete-

time series framework is proposed to forecast the service demand. This framework handles

three distinct types of memory range: short term, midterm, and long term [44-45].

2.5 Summary

In this chapter, we have reviewed different approaches in time series modeling and demand

forecasting, including statistical, machine learning, and state-of-art deep learning models.

Traditional statistical models have been widely used in various domains. Machine learning

methods are considered as a more accurate and reliable way for demand forecasting, in

which artificial neural networks (ANN) and support vector machines (SVM) are among

the most popular ones. But in the era of big data, neither statistical nor machine learning

models can deal with large-scale datasets. With the help of GPU boosting, deep learning

approaches have become popular to handle complicated time series problems. By applying

these mathematical forecasting techniques to predict taxi demand, operating costs, waiting

time of passengers and idle taxis can be reduced. However, most of the studies used only

time-series demand data for prediction without considering important exogenous variables,

such as weather or location popularity, etc. It is not trivial to fuse multi-source information

to make a real-time prediction using deep learning techniques.

10

Chapter 3

DATA SOURCES

In this chapter, different data sources for our model are introduced. In this thesis, we focus

on predicting next 15-minute pick-up demand giving the time-series information from the

previous hour. To do this, many features can be considered to build a good model.

According to intuition and knowledge of transportation demand modeling, we leverage

several factors that are very likely to affect taxi demand: taxi pick-up requests, taxi drop-

off requests, weather and popularity.

3.1 Pick-up and drop-off requests

With the help of GPS-location services, creative companies (e.g. Uber and Lyft) have

introduced new mobility services that customers can call for a taxi on a smartphone instead

of randomly taking it on the street. Once a passenger successfully finds a taxi, the trip

information containing time and location is clear. And these real-time taxi trip records are

the key to build time series demand forecasting model.

The area of interest can be divided into zones. We first aggregate the dataset by

time intervals (i.e., 15 minutes), and then the demand data in a zone in each time interval

was obtained by summing up the requests from that zone during that time interval. The

dataset is then divided into pick-up and drop-off datasets.

3.2 Exogenous data

Traditional time series analysis pays much attention to the temporal characteristics. But as

is shown in Figure 4, we can tell that both temporal and spatial characteristics should be

taken into account in demand forecasting. In this thesis, pick-up and drop-off requests

contain both temporal (i.e. travel time) and spatial (i.e. location ID) characteristics. In order

11

to get a more accurate prediction, we consider some exogenous data (i.e. popularity and

weather) that may impact demand.

3.2.1 Weather

Weather data can be collected from various sources. For example, in the USA, the National

Oceanic and Atmospheric Administration (NOAA) website provides hourly aggregated

weather information. The information includes hourly maximum temperature, minimum

temperature, precipitation, average wind speed and snowfall.

Temperature is the most straightforward factor. It has strong temporal

characteristics, e.g., the maximum and minimum temperature during a day generally show

up in the afternoon and midnight, separately. Moreover, seasonality is another feature of

temperature. Typically, the average temperature in winter is much lower than that in the

summertime.

Precipitation is another aspect of weather condition. Wind and snow are two factors

that affect people’s activities in multiple ways. For instance, people would prefer public

transportation if they encounter a snowstorm. Or they would rather stay at home if there is

an intensive wind.

Overall, weather information is of great importance in demand forecasting,

especially in some extreme cases (e.g. snowstorm and rainstorm).

3.2.2 Location popularity

Popularity is another exogenous data that can be used in demand forecasting. Generally,

location popularity is a collective perception, which means it takes a group of people to

have the same feeling towards individuals or objects. When rating popularity, different

groups hold either positive or negative point of view. The more attention a location gets,

the more popular it will be.

Location popularity can be numerically represented by the number of reviews from

points of interest (POIs). If a POI gets numerous reviews from visitors, it indicates that this

site is popular among people. So, we can use the number of reviews from POIs as a spatial

12

feature and use it for taxi demand forecasting, i.e., the POIs with more reviews in a region

have higher weights when predicting the demand in that zone.

In this thesis, we utilize the Yelp Fusion API to extract the number of historical

reviews at each point of interest. In Manhattan, for example, there are 11,760 POIs that

include most of the businesses there. We then merge the review counts in each taxi zone to

obtain the average, median, maximum, minimum and standard error values to represent the

popularity of each zone. We assume that popularity is stationary during the analysis period,

which implies that popularity is only a zonal based feature.

13

Chapter 4

DEEP LEARNING MODELS

Deep learning approaches have been used for building accurate demand forecasting models.

It is essentially part of machine learning and is based on neural networks. Each neuron is

treated as a ‘black box’, and we only focus on the input and output layers. Deep learning

models have a better performance on large datasets compared to traditional machine

learning approaches. In this chapter, we introduce four popular deep learning models:

convolutional neural networks (CNN), recurrent neural networks (RNN), long-short term

memory networks (LSTM) and convolutional long-short term memory networks (Conv-

LSTM). We do not dig into the details and variations of each model, instead, we focus on

the basic versions of each model and discuss its pros and cons.

4.1 Convolutional neural networks (CNN)

Convolutional neural networks (CNN) are a specialized type of neural network that has

proven effective in areas such as image recognition and classification. CNN has been

successful in identifying faces, objects, and traffic signs apart from powering vision in

robots and self-driving cars.

Central to the convolutional neural network is the convolutional layer that gives the

network its name. This layer performs an operation called a “convolution “. In the context

of a convolutional neural network, convolution is a linear operation that involves the

multiplication of a set of weights with the input, much like a traditional neural network.

Given that the technique was designed for two-dimensional input, the multiplication is

performed between an array of input data and a two-dimensional array of weights, called a

filter or a kernel.

The filter is smaller than the input data and the type of multiplication applied

between a filter-sized patch of the input and the filter is a dot product. A dot product is an

element-wise multiplication between the filter-sized patch of the input and filter, which is

14

then summed, always resulting in a single value. Because it results in a single value, the

operation is often referred to as the “scalar product “. Using a filter smaller than the input

is intentional as it allows the same filter (set of weights) to be multiplied by the input array

multiple times at different points on the input. Specifically, the filter is applied

systematically to each overlapping part or filter-sized patch of the input data, left to right,

top to bottom.

A typical CNN architecture is shown in Fig. 1. Unlike a fully connected neural

network in which the hidden activation H is computed by multiplying the entire input V

and weights W, CNN leverage convolution kernels to multiply a small local input (i.e., [v1,

v2, v3]) against the weights W. Then the kernel moves to next local input (i.e., [v2, v3, v4]),

which means the kernel is fixed and the weights W are shared across the entire input V.

After computing the hidden units, a maxpooling layer with filters of given pooling size

(e.g., 2 ´ 2) outputs the maximum of activations in each filter, which means every MAX

operation discards 75% of the activations if the filter size is 2 ´ 2. Compared with fully

connected neural networks, CNN can progressively reduce the number of parameters and

avoid overfitting.

Figure 1 Illustration of a typical CNN architecture

15

4.2 Recurrent neural networks (RNN)

Recurrent neural networks are another specialized type of neural networks where

the output from the previous step are fed as input to the current step. In traditional neural

networks, all the inputs and outputs are independent, but in the case when predicting the

next word of a sentence, the previous words are required, and hence there is a need to

remember them. RNN solve this issue with the help of a Hidden Layer. The main and most

important feature of RNN is the Hidden state, which remembers some information about a

sequence.

RNN has a “memory” which remembers all information about what has been

calculated. It uses the same parameters for each input as it performs the same task on all

the inputs or hidden layers to produce the output. This reduces the complexity of the

parameters, unlike other neural networks. Fig. 2 shows the basic RNN structure and its

unrolled version. At a particular timestep t, X (t) is the input to the network and h (t) is the

output of the network. A is the RNN cell which contains neural networks just like a

feedforward net.

Figure 2 Illustration of a typical RNN architecture (adopted from [47])

16

First, RNN takes the X (0) from the sequence of input and then outputs h(0) which

together with X(1) is the input for the next step. Next, h (1) from the next step is the input

with X (2) for the next step and so on. With this recursive structure, RNN keeps

remembering the context while training. We can define the values of hidden units using Eq.

(1):

ℎ" = ∅(W ∙ 𝑋" + U ∙ ℎ"#!)(1)

where 𝒉𝒕 is the hidden state at timestamp t, ∅ is the activation function (either Tanh or

Sigmoid), 𝐖 is the weight matrix for input to hidden layer at timestamp t, 𝑿𝒕 is the input

at timestamp t, 𝐔 is the weight matrix for the hidden layer at time t-1 to hidden layer at

time t, and 𝒉𝒕#𝟏 is the hidden state at timestamp t.

RNN learn weights 𝐔 and 𝐖 through training using back propagation. These

weights decide the importance of the hidden state of the previous timestamp and the

importance of the current input. Essentially, they decide how much value from the hidden

state and the current input should be used to generate the current input. The activation

function ∅ adds non-linearity to RNN, thus simplifying the calculation of gradients for

performing back propagation.

4.3 Long-short term memory networks (LSTM)

Long-short term memory networks (LSTM) are actually a special recurrent neural network

(RNN) architecture. Sometimes we only focus on recent information to perform the present

task. For example, we consider building a language model to predict the next word based

on previous ones in a sentence. If we want to know the last word in the sentence of “The

clouds are in the --.”, we do not need much information but the word ‘clouds’. It is pretty

obvious that the last word is going to be ‘sky’. In such cases, RNNs can deal with such

prediction tasks very well.

Unfortunately, RNNs are not always capable of handling long-term dependencies,

which LSTMs handle well. Consider trying to predict the last word in the text “I grew up

in China……I speak fluent --.” The recent context indicates that the last word is probably

17

the name of a language. But we need to know the context of ‘China’ from the further back

if we want to predict ‘Chinese’ for the last word. LSTMs are explicitly designed to capture

this long-term dependency problem.

LSTMs also have this chain-like structure, but the LSTM cell is quite special. In

standard RNNs, each cell has only one single neural network layer, while there are 4 in

each LSTM cell (i.e. shown in Fig. 3). As demonstrated in Eqs. (2)-(7), ft, it, Ct, ot represent

the forget gate, input gate, memory cell, and output gate, respectively, sharing the same

dimension with ht.

Figure 3 Illustration of a standard LSTM cell (adopted from [48])

𝑓" = σ4𝑊& ∙ [ℎ"#!, 𝑥"] + 𝑊'& ⨀ 𝐶"#! +𝑏&=(2)

𝑖" = σ(𝑊( ∙ [ℎ"#!, 𝑥"] + 𝑊'( ⨀ 𝐶"#! +𝑏()(3)

𝐶A" = tanh(𝑊' ∙ [ℎ"#!, 𝑥"] + 𝑏')(4)

𝐶" =𝑓"⨀𝐶"#! + 𝑖"⨀𝐶A"(5)

𝑜" = σ(𝑊) ∙ [ℎ"#!, 𝑥"] + 𝑊') ⨀ 𝐶"#! +𝑏))(6)

ℎ" =𝑜"⨀tanh(𝐶")(7)

𝑊& , 𝑊( , 𝑊' , 𝑊) , 𝑊'& , 𝑊'( , 𝑊') are matrices of weights which conduct a linear

transformation, while 𝑏& , 𝑏( , 𝑏' , 𝑏) are parameters of bias. It is noteworthy that 𝐶" is a

parameter matrix which acts as an accumulator of the state information. Every time a new

input comes, the information will be accumulated into the memory cell 𝐶" once the input

18

gate 𝑖" is activated. Also, the past cell status 𝐶"#! could be forgotten if the forget gate 𝑓" is

on. Whether the latest cell output 𝐶" will be propagated to the final state is further

determined by the output gate 𝑜" . The operator ⨀ refers to the Hadamard product that

conducts an element-wise multiplication operation. σ and tanh are two non-linear

activation functions given by:

σ(𝑥) = !!*,!"

(8)

tanh(𝑥) = ,"#,!"

,"*,!"(9)

LSTMs are more efficient and accurate than standard RNNs specifically for the

taxi demand forecasting problems. The idea is that taxi demand has a strong pattern across

the time of day and day in a week. For example, taxi demand has a morning peak and

evening peak during the day. If we remember the morning peak from a previous day, we

may get a better prediction outcome for the next morning peak, even if we only have inputs

of historical records from the previous hour.

4.4 Convolutional long-short term memory networks (Conv-LSTM)

Although the LSTM layer has been proven to be quite suitable to handle data with temporal

characteristics, it lacks the ability to extract spatial information. To address this problem,

we introduce the convolutional long-short term memory network (Conv-LSTM), which is

an extension of LSTM.

In order to capture the spatial information, all elements in Conv-LSTM, including

input, output, hidden state, memory cells, input gate, output gate, and forget gate, are

resized to 3D tensors with the last two dimensions (i.e., height and width) representing the

spatial characteristics. We assume that the input is an “image” in each timestep. Then the

spatiotemporal information of image flows in the Conv-LSTM cells. The future output is

determined by both output and input from previous timesteps. This process can be achieved

by leveraging convolution operator in the state-to-state and input-to-state transitions. So,

we just need to replace the dot product operator (“ ∙ ”) in LSTM with the convolution

operator (“ * ”). The key equations are given in Eqs. (10)-(15) below:

19

𝒻𝓉 = σ4𝒲𝒻 ∗ [𝒽𝓉#!, 𝓍𝓉] + 𝒲𝒞𝒻 ⨀ 𝒞𝓉#! +𝒷𝒻=(10)

𝒾𝓉 = σ(𝒲𝒾 ∗ [𝒽𝓉#!,𝓍𝓉] + 𝒲𝒞𝒾 ⨀ 𝒞𝓉#! +𝒷𝒾)(11)

𝒞A𝓉 = tanh(𝒲𝒞 ∗ [𝒽𝓉#!,𝓍𝓉] + 𝒷𝒞)(12)

𝒞𝓉 =𝒻𝓉 ⨀𝒞𝓉#! + 𝒾𝓉 ⨀𝒞A𝓉 (13)

ℴ𝓉 = σ(𝒲ℴ ∗ [𝒽𝓉#!,𝓍𝓉] + 𝒲𝒞ℴ ⨀ 𝒞𝓉#! +𝒷ℴ)(14)

𝒽𝓉 =ℴ𝓉 ⨀tanh(𝒞𝓉)(15)

The forget gate tensors, input gate tensors, memory cell tensors, output gate tensors,

hidden state tensors, input tensors at 𝓉 timestep are denoted as 𝒻𝓉 , 𝒾𝓉 , 𝒞𝓉 , ℴ𝓉 , 𝒽𝓉 , 𝓍𝓉 Î

𝑅2×4×5, respectively, where H and W (i.e., height and width) stand for spatial dimensions.

And all the weight tensors, including 𝒲𝒻 , 𝒲𝒾 , 𝒲𝒞 , 𝒲ℴ , 𝒲𝒞𝒻 , 𝒲𝒞𝒾 , 𝒲𝒞ℴ , are fixed for

each convolution kernel, which implies that the weights are shared when the kernel moves.

We can extract the different spatial features (e.g., congestion and crowds) by using multi

kernels. To make sure that the states have the same size of height and width as inputs, we

represent no values with zeros when the kernel moves to the boundary, which is called the

zero-padding.

For a large-scale area, the taxi demand will depend on not only temporal features

but also spatial characteristics, e.g., the central business district (i.e. CBD) of a city tends

to have more taxi demand. Conv-LSTM is an appropriate model to deal with such

spatiotemporal datasets.

4.5 Summary

In this chapter, we discussed four typical deep learning models that can be adopted in taxi

demand forecasting, including CNN, RNN, LSTM and Conv-LSTM. We presented the

basic version of each model and analyzed their pros and cons, as shown in Table 2.

20

Table 1 Comparison of basic deep learning models Model Pros Cons RNN CNN LSTM Conv-LSTM

-Time series analysis -Spatial information extraction -efficiency in the training process -Time series analysis -Capable of handling long-term dependency -time series analysis - Capable of handling long-term dependency -Spatiotemporal information extraction

-costly computation time -not capable of handling long-term dependency -not widely used in time series analysis -not capable of extracting spatial information -costly computation time -costly computation time

However, a single deep learning model is not able to capture multiple features from

different sources. In such cases, we need to ensemble those models so that we can

incorporate multiple information sources and get accurate predictions. In the next chapter,

the proposed model is represented. It is an adaptive demand forecasting model that fuses

multiple models to build a better estimator.

21

Chapter 5

METHODOLOGY

As introduced in chapter 4, each deep learning approach has not only advantages but also

limitations. In other words, a simple deep learning model can be adopted in a certain

situation, e.g., LSTM is more effective at dealing with time-dependent data while CNN is

good at extracting spatial information.

Generally, taxi demand patterns can be complex with various relationships between

demand and exogenous data. In such cases, it is hard to get optimal performance in terms

of prediction accuracy even if we select the most appropriate model from the candidates

above. In this chapter, we propose a novel deep learning structure which combines several

basic approaches into a stronger one. First, we use different models to extract different

information (i.e. demand, weather and popularity). Then, we fuse those data using the

ensemble method. Finally, we train the model and get the best weights combination.

5.1 Preliminaries

The short-term taxi demand forecasting problem is inherently a time series prediction

problem, which implies that we can use taxi demand from previous time periods as valuable

information for future prediction. This thesis focuses on the prediction of taxi pick-ups in

the next 15 minutes. Conceptually, the number of drop offs in previous periods could

influence taxi pick-up demand, since people may potentially take a taxi to a place for a

particular activity (e.g., concert, party, work, etc.), and then return by using a taxi again. In

addition, weather conditions and location popularity could also impact the demand

generation. Here, the definitions and notations of the variables used in this thesis are

described:

22

5.1.1 Taxi zones and time partition

Taxi pick-up demand is both time-dependent and zonal-based. First, we need to define the

temporal and spatial characteristics, which are time interval and taxi zones, respectively.

The urban area is partitioned into small grids with an irregular shape, where each

grid represents a taxi zone. The driver can pick up passengers in one zone and drop them

off in another zone. Then, each taxi zone has both pick-up and drop-off records. For the

time interval, we aggregate variables in every 15 minutes, which implies the time interval

is 15 minutes. In this thesis, we focus on predicting pick-up demand in the next 15 minutes

using data from the previous hour. In other words, the inputs for pick-up and drop-off

requests will be the data in four intervals and output has one interval.

5.1.2 Pick-up and drop-off demand

As discussed above, pick-up and drop-off demand have the same dimensions. So, we can

regard them as a two-dimension variable. The number of requests is aggregated in each

time interval (i.e. 15 minutes) and taxi zone.

The pick-up demand at the tth time slot (e.g., 15 min) in zone i is defined as the

number of pickups during this time interval within the zone, denoted by pt,i. The pick-up

demand for all the zones in each time interval is defined as matrix 𝑃", where the ith element

is (𝑃")i = pt,i. A similar definition for drop-offs where dt,i denotes the number of drop-off

records at the tth time slot (e.g., 15 min) in zone i. Again, the drop-off records for all the

zones in each time interval is kept in matrix 𝐷", where the ith element is (𝐷")i = dt,i.

5.1.3 Weather information

Weather is another input in our model. It contains several features which have an impact

on human mobility and activity behavior. For instance, some extreme weather conditions

like a blizzard and dense fog will directly influence both passengers and drivers on their

travel choices.

In this thesis, we consider 7 categories of weather variables, including maximum

temperature (measured in Fahrenheits), minimum temperature (Fahrenheits), precipitation

23

(millimeters), average wind speed (meters per second), snowfall (inches), smoke or haze

(dummy variable), heavy fog or heavy freezing fog (dummy variable).

All of the aforementioned variables have hourly values (i.e., variables are taken

average per hour). And we assume that weather variables only have temporal dependencies

(i.e., variables have the same values across zones).

We denote these variables in different ways since we have both numerical and

categorical variables. Thus, maximum temperature, minimum temperature, precipitation,

average wind speed and snowfall at time t slot are denoted by wt,tmax, wt,tmin, wt,prcp, wt,awnd,

wt,snow, respectively.

We introduce dummy variable wt,skhz to characterize the attribute of smoke or haze,

given by:

𝑤",789: = \1,𝑖𝑓𝑠𝑚𝑜𝑘𝑒𝑜𝑟ℎ𝑎𝑧𝑒ℎ𝑎𝑝𝑝𝑒𝑛𝑠𝑎𝑡𝑡"9𝑡𝑖𝑚𝑒𝑠𝑙𝑜𝑡0,𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

We also denote another dummy variable wt,hfog to be heavy fog:

𝑤",9&); = \1,𝑖𝑓ℎ𝑒𝑎𝑣𝑦𝑓𝑜𝑔ℎ𝑎𝑝𝑝𝑒𝑛𝑠𝑎𝑡𝑡"9𝑡𝑖𝑚𝑒𝑠𝑙𝑜𝑡0,𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

5.1.4 Popularity

Popularity is a novel variable that we adopt in our demand forecasting problem. It reflects

the importance of a specific spot or region. In other words, it keeps spatial information of

different weights for different spots or regions. Popularity can be represented by various

features. In this thesis, POIs are used to measure location popularity.

A point of interest (POI) represents a specific point location that people may find

useful and interesting. So, POIs in a region inform the popularity of this region. We assume

that popularity is a zonal-based attribute. The popularity is defined as the number of

reviews of POI within the zone. The average review number, median review number,

minimum review number, maximum review number and standard deviation in zone i are

defined as the average value, median value, minimum value, maximum value and standard

deviation of location popularity, denoted by 𝑟<=;( , 𝑟>,?( , 𝑟>(@( , 𝑟><A( and 𝑟7"?( , respectively.

24

5.1.5 Problem formulation

The prediction problem is to predict pick-up demand for the next time interval (15 min) at

each taxi zone using pick-ups, drop-offs, weather and location popularity information. It

can be formulated as:

𝑃k"*! = (𝑃7|𝑠 = 𝑡, 𝑡 − 1,… , 𝑡 − 𝑚;𝐷7|𝑠 = 𝑡, 𝑡 − 1,… , 𝑡 − 𝑚;𝑅:|𝑧 = 1,2… ,63;𝑊7|𝑠

= 𝑡, 𝑡 − 1,… , 𝑡 − 𝑚)(16)

Where 𝑃k"*! is the taxi pick-up demand prediction for the next 15 minutes. 𝑚 is the look-

back time window. And 𝑃 , 𝐷 , 𝑅 , 𝑊 represent pick-up demand, drop-off demand,

popularity and weather, respectively.

5.2 Multi-source information based spatiotemporal neural network (MSI-STNN)

This section presents the proposed deep learning architecture, i.e., MSI-STNN, to integrate

spatiotemporal variables (i.e., pick-up and drop-off records) and exogenous information

(i.e., weather and popularity) for short-term taxi demand forecasting. Specifically, the

method is composed of 3 deep learning models: convolutional long short-term memory

(Conv-LSTM), long short-term memory (LSTM) and convolutional neural network (CNN),

which are utilized to capture different characteristics. We subsequently concatenate these

extracted features to get the prediction outcome.

We propose the multi-source information based spatiotemporal neural network

(MSI-STNN) model to integrate data from multiple sources into our deep learning

architecture. The structure of MSI-STNN is shown in Fig. 4. The stacked Conv-LSTM

layers are used to handle the spatiotemporal variables (i.e., pick-up and drop-off demand).

The LSTM layers are implemented to deal with the temporal variables (i.e., weather), while

the CNN layers are adopted to extract the popularity features in each zone. The encoded

information from the different sources is concatenated, and two fully connected layers (i.e.,

dense layer) are used as the decoder.

25

Figure 4 Framework of the proposed MSI-STNN model

5.2.1 Structure of spatial variables

The point of interest (POI) in the studied area is the spatial variable. The reason is that the

number of POIs reflects the popularity in the zone. A CNN architecture is applied to extract

information about location popularity. The formulation is given below:

𝑟( = 4𝑟<=;( , 𝑟>,?( , 𝑟>(@( , 𝑟><A( , 𝑟7"?( =(17)

R = 4𝑅!, … , 𝑅(= = ℱBCDEFGGHℱB')@=ℱ!CDEFGGHℱ!')@=4𝑟!, … , 𝑟(=(18)

𝒱kI = 𝜎(𝑤J ∗ 𝑅 +𝑏J)(19)

where 𝑟<=;( , 𝑟>,?( , 𝑟>(@( , 𝑟><A( , 𝑟7"?( are the average, median, minimum, maximum and

standard deviation values of the popularity in zone i, respectively. And 𝑤J , 𝑏J are the

weights and bias, respectively.

5.2.2 Structure of temporal variables

Weather conditions have no variability across zones, which implies that weather is a non-

spatial variable. But it changes as the time goes by, which means it is a temporal variable.

The change of weather conditions, especially some extreme weather conditions (e.g.,

26

snowstorm and rainstorm), would affect the pickup demand. A sequence of vectors defined

as: 𝑝" = (wt,tmax, wt,tmin, wt,prcp, wt,awnd, wt,snow, wt,skhz, wt,hfog), is fed into the stacked LSTM

architecture, and transferred to the encoded weather information 𝒱k"𝓌 .

(𝑃"#8HL , … , 𝑃"#!HL ) = ℱHLHMNC …ℱ!HMNC(𝑝"#8 , … , 𝑝"#!)(20)

𝒱k"𝓌 = 𝜎(𝑤F ∙ 𝑃"#!HL +𝑏F)(21)

where 𝑘 is the look-back time window, while 𝐿𝑤 is the last layer of the stacked LSTM

architecture. 𝑃"#8HCDE , … , 𝑃"#!HCDE stands for the output of the last layer. 𝑤F, 𝑏F are the values

for the weights and bias.

5.2.3 Structure of spatiotemporal variables

The spatiotemporal variables (i.e., pick-up and drop-off demand) share the same training

structure, but they are handled separately. Each one is composed of stacked Conv-LSTM

layers, one batch normalization layer (BN) and one dropout layer (DO). The formulation

of the architecture for the spatiotemporal variables is given in Eq. (22) – (27).

ℎ" = 4𝑝",!, … , 𝑝",(=(22)

𝑑" = 4𝑑",!, … , 𝑑",(=(23)

4𝐻"#8HO , … , 𝐻"#!

HO = = ℱHOPGℱHOQ2ℱHO')@=HMNC …ℱ!')@=HMNC(ℎ"#8 , … , ℎ"#!)(24)

𝒱k"𝓅 = 𝜎4𝑤4 ∙ 𝐻"#!

HO +𝑏4=(25)

4𝐷"#8H? , … , 𝐷"#!H? = = ℱH?PGℱH?Q2ℱH?')@=HMNC …ℱ!')@=HMNC(𝑑"#8 , … , 𝑑"#!)(26)

𝒱k"𝒹 = 𝜎4𝑤P ∙ 𝐷"#!H? +𝑏P=(27)

where (ℎ"#8 , … , ℎ"#!) , (𝑑"#8 , … , 𝑑"#!) are the input vectors for pick-up and drop-off

demand, respectively. 𝒱k"𝓅 , 𝒱k"𝒹 are the encoded output vector of pick-up and drop-off

demand, respectively. 𝑘 is the look-back time window. 𝑖 is the number of zones. 𝐿𝑝, 𝐿𝑑

are the last layer of Conv-LSTM for pick-up and drop-off demand, respectively. And 𝑤4,

𝑤P, 𝑏4, 𝑏P are the weights and bias parameters.𝜎 is the sigmoid function which has been

defined in Eq. (8).

27

5.2.4 Information fusion

The encoded information from different sources is concatenated and then decoded by

deploying two dense layers. The predicted pick-up demand at time t is:

𝒱k" = 4𝒱k"𝓅 , 𝒱k"𝒹 , 𝒱k"𝓌 , 𝒱kI=(28)

𝒱k",TU"(><", = ℱB?,@7,ℱ!?,@7, (𝒱k")(29)

where 𝒱k" is the concatenated vector, while 𝒱k",TU"(><", is the ultimate prediction at time t.

5.2.5 Objective function

For the training process, mean square error (MSE) between prediction and actual demand

is used as the loss function, given by:

loss = !@∑ (𝒱k",TU"(><", − 𝒱",<V"T<U)B@"W! (30)

where n is the number of timesteps. The error can be reduced by learning the weights and

bias through back propagation. The training process of this model is illustrated in

Algorithm 1:

Algorithm 1. MSI-STNN training

Input Pick-up demand observations {P1,…,Pn} in training set

Drop-off demand observations {D1,…,Dn} in training set

Weather observations {W1,…,Wn} in training set

Location popularity observations {R1,…,Rn} in training set

Input time step: K + 1

Input zone id: i

Output MSI-STNN with learnt parameters

Procedure MSI-STNN train

1: Initialize a null set: V ← ∅

2: for all available time intervals t (1 ≤ 𝑡 ≤ 𝑛) do

3: 𝔙"O = [ℎ"#8 , … , ℎ"#!]

4: 𝔙"? = [𝑑"#8 , … , 𝑑"#!]

5: 𝔙"L = [𝑝"#8 , … , 𝑝"#!]

28

6: 𝔙I = [𝑟!, … , 𝑟(]

where 𝑝" = (𝑤","><A , 𝑤",">(@, 𝑤",OIVO, 𝑤",<L@? , 𝑤",7@)L , 𝑤",789: , 𝑤",9&);) , where

𝑟( = (𝑟<=;( , 𝑟>,?( , 𝑟>(@( , 𝑟><A( , 𝑟7"?( ) u where 𝔙"O , 𝔙"

? , 𝔙"L , 𝔙I are the input sets of

different categories of explanatory variables in one sample.

7: training sample (𝔙"O, 𝔙"

?, 𝔙"L, 𝔙I) is put into V

8: end for

9: Initialize all the weights and bias parameters

10: repeat

11: Randomly extract a batch of samples 𝑽𝒃 from V

12: Update the parameters by minimizing the objective function shown in Eq. (29)

within 𝑽𝒃

13: until convergence criterion met

14: end procedure

29

Chapter 6

CASE STUDY

In the previous chapter, the preliminaries of different variables and our model structure

have been proposed. But how do they perform to solve real-world problems? This chapter

introduces a case study using taxi data from Manhattan, NY, to test and analyze the

performance of our model. As one of the most congested and busiest regions in the world,

Manhattan has a remarkably rich history of taxi services. A huge dataset consisting of one-

year Manhattan taxi trips is used for training. A three-month dataset is used for testing. In

section 6.3, we perform sensitivity analysis (i.e. hyperparameter fine-tuning) to get the

optimal model structure. Analysis of results to test our model’s feasibility is presented in

section 6.4. We conclude the chapter with visualizations of the prediction outcomes using

both spatial and temporal views.

6.1 Study site and dataset The study site is located in Manhattan, NY, which has been partitioned initially into 69 taxi

zones, as is shown in Fig. 5. The zones with ID: 103, 104, 105, 153, 194 and 202 are not

considered in this thesis, where the taxi demands are mostly zeros. The reason is that we

want to avoid sparsity problems in our input matrix if we take those zones into account.

That is, most cells will become zero, which has a great impact on the weights updating in

training. This can lead to inaccuracy of the prediction outcome.

30

Figure 5 Taxi zones in Manhattan

The taxi pick-up and drop-off requests are extracted from NYC Taxi & Limousine

Commission (TLC) during the period from January 1st, 2017 to March 31st, 2018.

According to TLC company, the Manhattan area has been partitioned into several taxi

zones, which we will elaborate later. The dataset contains 200,000 of yellow taxi requests

per day in total, and each of them includes pick-up time, drop-off time, pick-up location

ID, and drop-off location ID. Each time record is a timestamp while each location ID is

related to a specific taxi zone (e.g. if location ID is 1, then this record belongs to the taxi

zone 1).

Both, pick-up and drop-off datasets are partitioned into a training set comprised of

requests between January 1st, 2017 and December 31st, 2017, and a testing set containing

the rest of observations from January 1st, 2018 to March 31st, 2018. Fig. 6 shows the daily

pick-up records in January 2018, which is clear that the pick-up demand on January 4th,

2018 is extremely low, and the reason is that there was a snowstorm during that day, which

may have a great impact on the transportation. Fig. 7 shows the spatial distribution of pick-

31

up demand on January 26th, 2018, between 6 p.m. and 7 p.m., from which we can see that

most of the pick-up requests increased in the middle part of Manhattan. This spatiotemporal

characteristic is a great challenge for the taxi short-term demand forecasting, which inspires

us to introduce popularity and weather datasets to obtain a better prediction.

Figure 6 Daily pick-up demand in January 2018

Figure 7 Spatial distribution of pick-up demand on January 26th, 2018

between 6 p.m. and 7 p.m.

32

6.2 Performance metrics

In this section, we introduce two metrics to test our model performance. One is mean

squared error (MSE) and the other one is Theil’s inequality coefficient (TIC) [46]. MSE is

a typical and traditional way in statistics, especially in regression problems. TIC was first

used in the business domain (e.g. economic forecasts), and then introduced to engineering.

6.2.1 Mean squared error (MSE)

The mean squared error (MSE) tells you how close a regression line is to a set of points. It

does this by taking the distance from the points to the regression line (these distances are

the “errors”) and squaring them. The squaring is necessary to remove any negative signs.

It also gives more weight to larger differences. It is called the mean squared error and

formulated as:

𝑀𝑆𝐸 = !2∑ (𝑌( − 𝑌k()B2(W! (31)

6.2.2 Theil’s inequality coefficient (TIC)

Theil’s inequality coefficient (TIC), also known as Theil’s U, provides a measure of how

well a time series of estimated values compares to a corresponding time series of observed

values. Theil’s U is calculated as:

𝑈 = Y#$∑ (\%

&'()#\%*+,)-$%.#

Y#$∑ (\%

&'())-$%.# *Y#

$∑ (\%*+,)-$%.#

(32)

where 0 ≤ 𝑈 ≤ 1. 𝑌OI,? is the prediction outcome, and 𝑌)^7 is the actual observation.

The model performs well if U is close to 0 (i.e. the error between prediction and observation

is small).

U can be decomposed into the bias (𝑈^ ), variance (𝑈= ) and covariance (𝑈V )

components given by:

𝑈^ = (\_&'()#\_*+,)-#$∑ (\%

&'()#\%*+,)-$%.#

(33)

𝑈= = (M&'()#M*+,)-#$∑ (\%

&'()#\%*+,)-$%.#

(34)

33

𝑈V = B(!#`)M&'()M*+,#$∑ (\%

&'()#\%*+,)-$%.#

(35)

where 𝑌�OI,?, 𝑌�)^7, 𝑆OI,?, 𝑆)^7 are the means and standard deviations of the observed and

predicted measurements, respectively, and 𝜌 is the correlation coefficient between the two

sets of measurements. And the three components satisfy the relationship below:

𝑈^ + 𝑈= + 𝑈V = 1(36)

The bias proportion reflects the systematic error. The variance proportion indicates

how well the model prediction replicates the variability in actual data. These two

proportions should be as small as possible (less than 0.2). The covariance proportion

measures the remaining error and therefore, should be close to one. Generally, if bias and

variance proportions are small, then the covariance proportion will be accordingly close to

one.

6.3 Sensitivity analysis

After introducing the performance metrics, we first do a sensitivity analysis, which aims at

hyperparameter fine-tuning. The objective is to find the best hyperparameter combination

resulting in the highest accuracy. We follow a sequential process changing one

hyperparameter at a time. Once we find the optimal value, we deal with the next

hyperparameter. In this section, we mainly focus on three hyperparameters (i.e. epoch,

optimizer, and batch size) which are the most important ones in our model structure.

6.3.1 Epoch

Epoch an iteration constituting one forward pass and one backward pass. In other words,

epoch is an iteration in the training process. As the training epoch increases, the validation

error (i.e. MSE in this thesis) of the model should decrease accordingly, which is the

process of model learning.

Firstly, to investigate the impact of the training epoch on the predictive performance,

the epoch is increased from 0 to 100 and the validation error (i.e. MSE) after each training

epoch is recorded for the proposed model. From Fig. 8 (i.e. initial training without any

34

modification of model structure), we can see that the validation error of the MSI-STNN

model decreases slowly in the initial 60 epochs and then drops sharply to around 130, which

indicates that the MSI-STNN model has a low learning speed which is expected since the

model structure is complicated (i.e., four parallel submodules) and the dataset is large (i.e.,

one-year training set and three months testing set). The computational time is around 100

seconds per epoch. The MSE reduces slightly from epoch 60 and epoch 100, changing from

around 130 to 118, while the computation time increases by 4,000 seconds. Considering

the trade-off between performance and computation time, the best training epoch for this

model is around 60, which sacrifices a little predictive performance to obtain better

computational performance.

Figure 8 Initial training process with 100 epochs

35

Based on the results, we have modified the structure of the model (i.e. as shown in

Fig. 9) since it took too much time to converge in our initial settings. In the new model

structure, we have removed batch normalization and dropout layers between Conv-LSTM

layers, which makes our model less complicated and faster to converge.

Figure 9 Framework of the modified MSI-STNN model

6.3.2 Optimizer

Optimizers are algorithms or methods used to change the attributes of the neural network

such as weights and learning rate in order to reduce the losses. How to change weights or

learning rates of the neural network to reduce the losses is determined by the optimizers

that we use. Optimization algorithms or strategies are responsible for reducing the losses

and for providing the most accurate results possible.

There are various optimizers used in deep learning. In this thesis, we consider three

types: Stochastic Gradient Descent (SGD), Adagrad, and Adaptive Moment Estimation

(Adam). Pros and cons of each optimizer are summarized in Table. 2.

36

Table 2 Comparison of various optimizers Optimizer Pros Cons SGD Adagrad Adam

-converge in less time -require less memory -automatically changeable learning rate -able to train on sparse data -converge rapidly -rectify vanishing learning rate, high variance

-high variance -slow learning rate reduction -expensive computation -learning rate decreasing in slow training -expensive computation

SGD is a variant of Gradient Descent. It tries to update the model’s parameters

more frequently. In this case, the model parameters are altered after computation of the

loss on each training example. So, if the dataset contains 1000 cases SGD will update the

model parameters 1000 times in one epoch instead of one time as in Gradient Descent. As

the model parameters are frequently updated, they will have high variance and fluctuations.

One of the disadvantages of optimizers is that the learning rate is constant for all

parameters and for each epoch. The learning rate of Adagrad optimizer is changeable. The

Adagrad optimizer changes the learning rate ‘η’ for each parameter and at every time

step. It is a second order optimization algorithm. It works with the derivative of the error

function. It achieves big updates for less frequent parameters and a small step for frequent

parameters.

Adam is an adaptive learning rate optimization algorithm which has been designed

specifically for training deep neural networks. The algorithm leverages the power of

adaptive learning rate methods to find individual learning rates for each parameter. It also

has the advantages of Adagrad, which works well in settings with sparse gradients, but

struggles in non-convex optimization of neural networks, and RMSprop, which works well

in on-line settings. Adam’s popularity has been growing exponentially recently.

Fig. 10 shows the training process using different optimizers. In this case, we only

focus on the computation time since our dataset is large, and the primary objective is to get

efficiency in the training process. We can see that the model converges rapidly when

adopting the Adam optimizer compared to the other two optimizers. Moreover, both SGD

and Adagrad experience two decays, which means that they converge in the first decay

37

while the overfitting problem shows up in the second decay. Overall, Adam is a better

optimizer for our model compared to SGD and Adagrad. And we also adopt Adam

optimizer in our initial model structure (i.e. as shown in Fig. 4).

Figure 10 Training process using different optimizers

6.3.3 Batch size

The batch size is a hyperparameter that defines the number of samples to use before

updating the internal model parameters. Think of a batch as a for-loop iterating over one

or more samples and making predictions. At the end of the batch, the predictions are

compared to the expected output variables, and the error is calculated. From this error, the

update algorithm is used to improve the model. A training dataset can be divided into one

or more batches. The size of the batch may be small if we use many batches.

In this thesis, we compare the training processes and performance (i.e. MSE and

TIC values) for batch sizes of 32, 64, 128 and 256. As shown in Fig. 11, for all batch sizes

the model converges rapidly (since we adopt Adam as the optimizer). But in the test set the

batch size of 32 shows some volatility early on. It makes sense because we have more

38

samples if the batch size is small, and it will consume more computation time to learn on

those samples, which causes uncertain direction for the loss decay. In other words, the

training process will be more stable if we adopt larger batch sizes.

Figure 11 Training using different batch sizes

Table 3 Performance comparison of various batch sizes Batch size MSE TIC (bias) TIC (variance) 32 64 128 256

119.823 109.762 114.904 137.296

0.40 0.21 0.49 0.09

0.43 0.38 0.63 0.14

While from a performance point of view (i.e. shown in Table. 3), we can see that the batch

size of 64 has the lowest MSE. And the value is much smaller than that of any other batch

size. For a batch size of 128, both the bias and variance of the TIC decompositions are high

(i.e. 0.49 and 0.64), which implies that the performance, when using 128 as the batch size,

is not good. Although the TIC values for a batch size of 256 are the smallest compared to

others (i.e. 0.09 and 0.14), this batch size has the largest value of MSE (i.e. 137.296).

Moreover, in the test set the batch size of 256 also shows some volatility. So, the batch size

39

should not be too large. With large batch sizes, the direction of the loss decay is unstable,

which leads to local optimum values but not the ultimate optimal one. Overall, 64 is the

most suitable batch size for our model.

6.3.4 Summary

Using sensitivity analysis, we mainly tuned three hyperparameters: epoch, optimizer, and

batch size. Firstly, we change the epoch from 0 to 100 using the initial model structure.

The validation error hardly goes down until about 50 epochs. And then, it sharply

converges at around epoch 60, which may indicate that the model structure is not suitable.

We made some modifications to the model before tuning the optimizer. We compare three

types of optimizers: SGD, Adagrad, and Adam. The ‘Adam’ optimizer is the fastest to

converge, which means that it is the most efficient. Lastly, we consider the impact of

various batch sizes. Using the MSE value and the TIC decompositions (i.e. bias and

variance), we concluded that a batch size of 64 is most suitable for our model.

6.4 Performance diagnostics

In this section, we present two comparisons to test the performance of the modified model

structure and adjusted hyperparameters. One is the comparison with some baseline models

(i.e. ARIMA, RNN and Conv-LSTM with pick-up demand) and the proposed model. The

other one is to evaluate the performance using K-fold cross-validation. We evaluate the

models via two measurements of effectiveness: MSE and TIC decompositions (i.e. bias,

and variance).

6.4.1 Comparison between baselines and MSI-STNN

In this thesis, we introduce ARIMA, RNN and Conv-LSTM with only pick-up demand

data as the benchmark models. By comparing our model against three standard approaches,

we will be able to draw a conclusion about the value of our formulation and the

incorporation of the multi-source information and spatiotemporal attributes.

40

Table. 4 lists the predictive performance of the proposed model and the three

baselines on the testing set. It can be found that the proposed MSI-STNN outperforms the

benchmarks in the two measurements of predictive performance, which validates the

importance of multi-source and spatiotemporal attributes in demand forecasting.

Table 4 Performance comparison between the proposed model and baselines Model MSE TIC (bias) TIC (variance) ARIMA RNN Conv-LSTM MSI-STNN (modified)

137.230 116.304 111.357 108.320

0.35 0.38 0.18 0.12

0.43 0.45 0.23 0.16

6.4.2 Evaluation using K-fold cross validation

Next, we adopt the K-fold cross-validation to evaluate model performance. As mentioned

earlier, we have a large training set from January 1st, 2017 to December 31st, 2017 and the

testing set is from January 1st, 2018 to March 31st, 2018. If we split the dataset into seasons,

we only have the first season of 2018 as the testing set. In this case, we will encounter

seasonality problems, which has a great impact on the accuracy of the prediction outcome.

In order to eliminate this influence on the demand forecasting, we use the K-fold cross-

validation.

Cross-validation is a resampling procedure used to evaluate machine learning

models on a limited data sample. The procedure has a single parameter called k that refers

to the number of groups that a given data sample is to be split into. As such, the procedure

is often called k-fold cross-validation. When a specific value for k is chosen, it may be used

in place of k in reference to the model, such as k=10 becoming 10-fold cross-validation.

In this thesis, we set k=4. We select three scenarios as the training set and the

remaining one as a validation set (i.e. shown in Fig. 12). The model is trained four times

with the different training sets, and each fold is used as a validation set.

41

Figure 12 K-fold cross-validation (K=4, red block: training set; blue block:

validation set) Table. 5 shows the predictive performance of the originally proposed model without using

the K-fold cross-validation and the model using the K-fold cross-validation. Indeed, the

accuracy increases using the K-fold cross-validation. However, it does not improve a lot.

Moreover, the computation time is costly since our model will run four times with such

huge dataset. Overall, K-fold cross-validation can be adopted to validate a model, but it is

not used to improve its performance.

Table 5 Performance comparison between the proposed model and the model with K-fold cross-validation

Model MSE TIC (bias)

TIC (variance)

MSI-STNN (modified) MSI-STNN (modified) with K-fold cross validation

113.076 105.209

0.16 0.19

0.25 0.22

42

6.5 Data visualization

In the last section, we present both spatial and temporal visualizations between the ground

truth pick-up demand and predicted results in the test dataset.

6.5.1 Temporal visualization

The temporal characteristics are visualized for both predicted and real demand (Fig. 13).

In order to show the results in high resolution, we only show the average demand for all

the taxi zones against the first 1,000 timesteps (i.e. x1, x2, x3, …, x1000). Each timestep is

15 minutes, so the results are for 10 days and 10 hours. The red line stands for ground truth

pick-up demand, and the green line is the corresponding prediction outcome.

It is obvious that the prediction results match the actual data very well. We can see

that the two curves have the same trend, and the error is pretty small. Even in some

abnormal cases (e.g. demand between the 300th and 400th timesteps), our model can predict

precisely.

Figure 13 Temporal prediction outcome for the first 1,000 timesteps

43

6.5.2 Spatial visualization

Next, we visualize our data geographically. Fig. 14 shows the spatial characteristics of

prediction and ground truth. It is the heatmap of the average error distribution (i.e.

subtraction of prediction and ground truth) for all the timesteps in the whole Manhattan

area, where deeper color implies a larger error (i.e. deep red for positive error and deep

blue for negative error).

The accuracy of demand forecasting varies substantially across zones. For instance,

the demand forecasting for taxi zones in the upper town has a small error, but it can be

large for some zones around Central Park. Midtown Manhattan has heavy traffic compared

to the upper town, and this makes it challenging to forecast the taxi demand accurately in

those zones. Overall, the errors are small.

Figure 14 Average error distribution across zones for all the timesteps

44

Chapter 7

CONCLUSION

7.1 Contributions

In this thesis, we propose a novel deep learning approach, based on the fusion of data from

multiple sources through a spatiotemporal neural network (MSI-STNN), to predict the next

15 minutes’ taxi pick-up demand. The model uses historical demand information and

information of POI in the zone of interest and weather conditions.

We use taxi data in Manhattan, New York, as the source of pick-up and drop-off

demand. The popularity and weather data are mined from a Yelp dataset and the National

Oceanic and Atmospheric Administration (NOAA), respectively. We quantify the

popularity data by using the number of reviews in a location point of interest (POI). To the

best of our knowledge, this was used for the first attempt in taxi demand forecasting.

Our model (i.e. MSI-STNN) consists of four submodules, using convolutional long

short-term memory (Conv-LSTM), long short-term memory (LSTM), and convolutional

neural network (CNN) models. Two Conv-LSTMs capture the spatiotemporal

characteristics of pick-up and drop-off demand simultaneously, while the CNN and LSTM

models extract spatial and temporal information about zonal popularity and weather

conditions.

We evaluate the MSI-STNN performance through a case study. The performance

metrics include the mean squared error (MSE) and two decomposition terms of Theil’s

inequality coefficient (TIC): bias and variance. Through a sensitivity analysis, three

hyperparameters were finetuned: epoch, optimizer, and batch size. Based on the results,

the initial model structure was modified. For the optimizer, we evaluated three different

alternatives: Adam, Adagrad and SGD algorithms, Adam was selected since it was the

fastest to converge (i.e. it is the most efficient one). The most suitable batch size was 64

resulting in a relatively small value of MSE and TIC decompositions (i.e. bias and

variance). The model performance was validated by first comparing its output with state-

45

of-art time series and deep learning approaches, including ARIMA, LSTM, and Conv-

LSTM. The proposed MSI-STNN outperforms the benchmark algorithms. The results

highlight the importance of multi-source information in demand prediction.

7.2 Future research

Future work can focus on exploring even more advanced deep learning architectures to

fuse multi-source information as well as constraints among zonal pick-ups and drop-offs.

In our case study, we used pick-up and drop-off demand from Manhattan taxi data, which

is a perfect dataset without any missing values. In the future, we need to test the robustness

of the model using other datasets that may have missing values. We only consider three

types of hyperparameters. But there are some other hyperparameters that can be tuned (e.g.

loss function).

51

REFERENCES [1] John H. Cochrane. Time Series for Macroeconomics and Finance. 1997.

[2] K.W. Hipel and A.I. McLeod. Time Series Modelling of Water Resources and

Environmental Systems. 1994.

[3] C.F. Lee, J.C. Lee, and A.C. Lee. Statistics for Business and Financial Economics.

World Scientific Publishing Co. Pte. Ltd, 1999.

[4] P.J. Harrison and C.F. Stevens. Bayesian Forecasting. Journal of the Royal Statistical

Society. Series B (Methodological), Vol. 38, No. 3, 1976, pp. 205-247.

[5] M.S. Ahmed and A.R. Cook. Analysis of freeway traffic time-series data by using box-

jenkins techniques, 1979.

[6] I. Okutani and Y.J. Stephanedes. Dynamic prediction of traffic volume through Kalman

filtering theory.

[7] R.O. Otieno J.M. Kihoro and C. Wafula. Seasonal time series forecasting: A

comparative study of arima and ann models. African Journal of Science and Technology

(AJST) Science and Engineering Series, 2004.

[8] P.M. Yelland. Bayesian forecasting of parts demand. International Journal of

Forecasting, 2010, pp. 374-396.

[9] B.M. Williams and L.A. Hoel. Modeling and forecasting vehicular traffic flow as a

seasonal ARIMA process: theoretical basis and empirical results. J. Transp. Eng. Vol. 129,

No. 6, 2003, pp. 664–672.

[10] J. Guo, W. Huang and B.M. Williams. Adaptive Kalman filter approach for stochastic

short-term traffic flow rate prediction and uncertainty quantification. Transportation

Research Part C: Emerging Technologies. Vol. 43, No. 1, 2014, pp. 50–64.

[11] E. Alpaydin. Introduction to Machine Learning. MIT Press, 2004.

[12] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning.

52

[13] W.S. McCulluoch and W.H. Pitts. A logical calculus of the ideas immanent in nervous

activity. The Bulletin of Mathematical Biophysics, Vol. 5, 1943, pp. 115-133.

[14] L. Breiman. Random Forest. Machine Learning, Vol. 45, No. 1, 2001, pp. 5-32.

[15] L. Zhang, Q. Liu, W. Yang, N. Wei and D. Dong. An Improved K-nearest Neighbor

Model for short-term Traffic Flow Prediction. Procedia-Social and Behavioral Sciences,

Vol. 9, No. 6, 2013, pp. 653-662.

[16] E.I. Vlahogianni, M.G. Karlaftis and J.C. Golias. Optimized and meta-optimized

neural networks for short-term traffic flow prediction: a genetic approach. Transportation

Research Part C, Vol. 13, No. 3, 2005, pp. 211–234.

[17] J. Kamruzzaman, R. Begg, and R. Sarker. Artificial Neural Networks in Finance and

Manufacturing. Idea Group Publishing, 2006.

[18] G. Zhang, B.E. Patuwo, and M.Y. Hu. Forecasting with artificial neural networks: The

state of the art. International Journal of Forecasting,1998.

[19] A. Azadeh and Z.S. Faiz. A meta-heuristic framework for forecasting household

electricity consumption. Applied Soft Computing, 2001.

[20] A.C. de Pina and G. Zaverucha. Combining attributes to improve the performance of

naive bayes for regression. IEEE World Congress on Computational Intelligence, 2008.

[21] A. Al-Smadi and D. M. Wilkes. On estimating arma model orders. IEEE International

Symposium on Circuits and Systems, 1996.

[22] X. Li, L. Ding, M. Shao, G. Xu, and J. Li. A novel air-conditioning load prediction

based on arima and bpnn model. Asia-Pacific Conference on Information Processing,

2009.

[23] S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation,

Vol. 9, No. 8, 1997, pp. 1735-1780.

[24] J. Chung, C. Gulcehre, K.H. Cho and Y. Bengio. Empirical Evaluation of Gated

Recurrent Neural Networks on Sequence Modeling, 2014.

[25] X. Ma, Z. Tao, Y. Wang, H. Yu and Y. Wang. Long Short-term Memory Neural

Network for Traffic Speed Prediction using Remote Microwave Sensor Data.

Transportation Research Part C: Emerging Technologies, Vol. 54, 2015, pp. 187-197.

53

[26] H. Yu, Z. Wu, S. Wang, Y. Wang and X. Ma. Spatiotemporal Recurrent Convolutional

Networks for Traffic Prediction in Transportation Networks. Sensors, Vol. 17, No. 7, 2017,

pp. 1501.

[27] N.G. Polson and V.O. Sokolov. Deep Learning for Short-term Traffic Flow Prediction.

Transportation Research Part C: Emerging Technologies, Vol. 79, 2017, pp. 1-17.

[28] X. Shi, Z. Chen, H. Wang and D.Y. Yeung. Convolutional LSTM Network: A

Machine Learning Approach for Precipitation Nowcasting, 2015.

[29] X. Ma, Z. Dai, Z. He, J. Ma, Y. Wang and Y. Wang. Learning Traffic as Images: A

Deep Convolutional Neural Network for Large-Scale Transportation Network Speed

Prediction. Sensors, Vol. 17, No. 4, 2017, pp. 818.

[30] J. Ke, H. Zheng, H. Yang and X. Chen. Short-term Forecasting of Passenger Demand

Under On-demand Ride Services: A Spatio-temporal Deep Learning Approach.

Transportation Research Part C, Vol. 85, 2017, pp. 591-608.

[31] J. Bao, P. Liu and S.V. Ukkusuri. A Spatiotemporal Deep Learning Approach for

Citywide Short-term Crash Risk Prediction with Multi-source Data. Accident Analysis and

Prevention, Vol. 122, 2019, pp. 239-254.

[32] B. Cule, B. Goethals, S. Tassenoy, and S. Verboven. Mining train delays. Advances

in Intelligent Data Analysis X, ser. LNCS vol. 7014, pages 113-124, 2011.

[33] D.M. Kline. Methods for multi-step time series forecasting with neural networks.

Information Science Publishing, pages 226-250, 2004.

[34] X. Li, G. Pan, Z. Wu, G. Qi, S. Li, D. Zhang, W. Zhang, and Z. Wang. Prediction of

urban human mobility using large-scale taxi traces and its applications. Frontiers of

Computer Science in China, pages 111-121, 2012.

[35] M.C. Gonzalez, C.A. Hidalgo, and A.-L. Barabasi. Understanding individual human

mobility patterns. Nature, pages 779-782, 2008.

[36] J. Gama and P. Rodrigues. Stream-based electricity load forecast. Knowledge

Discovery in Databases: PKDD, pages 446-453, 2007.

[37] B. Williams and L. Hoel. Modeling and forecasting vehicular traffic flow as a seasonal

arima process: Theoretical basis and empirical results. Journal of Transportation

Engineering, pages 664-672, 2003.

54

[38] K. Wong, S. Wong, M. Bell, and H. Yang. Modeling the bilateral micro-searching

behavior for urban taxi services using the absorbing markov chain approach. Journalof

Advanced Transportation39(1), pages 81-104, 2005.

[39] L. Moreira-Matias, J. Gama, M. Ferreira, and L. Damas. A predictive model for the

passenger demand on a taxi network. 15th International IEEE Conference on Intelligent

Transportation Systems (ITSC), pages 1014-1019, 2012.

[40] L. Liu, C. Andris, A. Biderman, and C. Ratti. Uncovering taxi drivers mobility

intelligence through his trace. IEEE Pervasive Computing 160, pages 1-17, 2009.

[41] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang. T-drive: driving

directions based on taxi trajectories. Proceedings of the 18th SIGSPATIAL International

Conference on Advances in Geographic Information Systems, ACM, pages 99-108, 2010.

[42] B. Li, D. Zhang, L. Sun, C. Chen, G. Qi S. Li, and Q. Yang. Hunting or waiting?

discovering passenger-finding strategies from a large-scale real-world taxi dataset. 2011

IEEE International Conference on Pervasive Computing and Communications Workshops

(PERCOM Workshops), pages 63-68, 2011.

[43] L. Moreira-Matias, J. Gama, M. Ferreira, and L. Damas. A predictive model for the

passenger demand on a taxi network. 15th International IEEE Conference on Intelligent

Transportation Systems (ITSC), pages 1014-1019, 2012.

[44] J. Cryer and K. Chan. Time Series Analysis with Applications. R. Springer, 2008.

[45] A. Ihler, J. Hutchins, and P. Smyth. Adaptive event detection with time-varying

poisson processes. Proceedings of the 12th ACM SIGKDD international conference on

Knowledge discovery and data mining, ACM, pages 207-216, 2006.

[46] T. Toledo and H.N. Koutsopoulos. Statistical validation of traffic simulation models.

Transportation Research Record: Journal of the Transportation Research Board, pages

142-150, 2004.

[47] V.S. Bawa. Basic architecture of RNN and LSTM.

https://pydeeplearning.weebly.com/blog/basic-architecture-of-rnn-and-lstm

[48] C. Olah. Understanding LSTM networks. https://colah.github.io/posts/2015-08-

Understanding-LSTMs/

Documents

MULTI-SOURCE INFORMATION BASED SHORT-TERM TAXI PICK- …m... · 2 days ago · network (MSI-STNN) deep learning architecture to predict short-term taxi pick-up demand. It fuses pick-up