22
Analysis and Prediction of Flight Prices using historical pricing data 1 st Swiss Hadoop User Group meeting – May 14, 2012 Jérémie Miserez - [email protected] 2012-05-14

14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Embed Size (px)

Citation preview

Page 1: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Analysis and Prediction of Flight Prices using historical pricing data

1st Swiss Hadoop User Group meeting – May 14, 2012 Jérémie Miserez - [email protected]

2012-05-14

Page 2: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Overview

Project setup Goals Exploratory data analysis (Hadoop) Classification & prediction methods Processing pipeline (Hadoop) Results

This project was done as part of my Bachelor’s thesis at the Systems Group, ETH Zürich, in collaboration with Amadeus IT Group SA.

2

Page 3: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Project setup

Airline tickets can be bought up to ~1 year in advance. Prices change from day to day.

Amadeus CRS is the largest global distribution system in the travel/tourism industry: sells tickets for 435 airlines (also hotels, cruises, etc.) processes ~850 million billable transactions per year

Amadeus provided us with a dataset containing buyable tickets for each day from May 2008 – Jan 2011.

3

Page 4: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Goals

1. Construct and train a general classifier so that it can distinguish between expensive and cheap tickets.

2. Use this classifier to predict the prices of future tickets.

3. Determine which factors have the greatest impact on price by analyzing the trained classifier.

But first: Need to understand dataset!

4

Page 5: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Exploratory data analysis

Extent of the dataset: 27.2 billion records 132.2 GiB (uncompressed) 63 departure airports, 428 destinations, 4387 routes, 117 airlines

5

Page 6: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Exploratory data analysis

The majority of activity is concentrated in Europe:

6

Page 7: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Exploratory data analysis

Lots of fields: “Buy” date: When was this price current? “Fly” date: When does the flight leave? … Price & currency … Cabin class Economy/Business/First (98% economy tickets) Booking class A - Z … Airline The airline selling the ticket. …

Not a time series, tickets are not linked over time.

7

Page 8: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Exploratory data analysis

Visualizing small subsets of the data helps understand the data.

Lots of simple Hadoop jobs used to preprocess the data, multiple visualizations using Matlab.

Can we see some patterns already?

8

Page 9: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Exploratory data analysis

9

December

July

Fly date

Buy

dat

e

600 EUR

2400 EUR

For ZRH-BKK, plot the prices of the cheapest tickets available every day:

Page 10: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Classification & Prediction methods

Implemented two different classifiers: Support vector machine (SVM) L1- regularized linear regression

Both are convex minimization problems that can be solved online by employing the stochastic gradient descent (SGD) method. Online algorithm results in constant memory usage, does not depend

on size of dataset. “Stochastic”: Select order of training points at random from dataset.

SGD can be parallelized (parallelized SGD)* with almost no overhead, and is very suitable for use with MapReduce.

* Zinkevich, M. Weimer, A. Smola, and L. Li. “Parallelized stochastic gradient descent”, 24th Annual Conference on Neural Information Processing Systems, 2010.

10

Page 11: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

SVM: binary linear classifier Goal: Find maximum-margin hyperplane

that divides the points with label “+1” from those with label “-1”.

After training: Hyperplane parameters: Get label for a data point as

Training:

Generate training label for i-th data point Choose hyperplane parameters so the margin is maximal and the training data

is still correctly classified:

Classification & Prediction methods

11

Page 12: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Implementation uses: Hinge loss function:

Takes into account “outliers”. Regularization parameter

Bounds length of , i.e. large increase generalization. Preprocess data for zero mean, unit variance

For training points: Margin: , with lower bound:

Classification & Prediction methods

12

Page 13: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Hadoop: Preprocessing

13

Generate training labels (y) from dataset: Convert currencies using historical exchange rates.

For each route r, calculate the arithmetic mean (and standard

deviation) of the price over all tickets.

Assign labels: Label +: “Above mean price for this route” Label -: “Below mean price for this route”

Only store mean/std-dev, do not actually store labels in the HDFS.

Page 14: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Hadoop: Preprocessing

14

Extract features from plaintext records (x). Each plaintext record is transformed into a 930-dimensional vector.

Each dimension contains a numerical value corresponding to a

feature such as: Number of days between “Buy” and “Fly” dates Week of day (for all dates) Is the day on a weekend (for all dates). Is the Currency CHF? etc.

Each dimension is normalized to zero mean and unit variance.

(per route r)

Page 15: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Hadoop: Processing pipeline

Shuffle the data (P)SGD demands random selection of

data points

Partition the data into n (=1200) chunks

Train using PSGD: Parallel training on k (=40) chunks Average hyperplane coefficients after

all 1200 chunks have been processed (= after 30 iterations).

We can get intermediate results by calculating the accuracy every time 40 chunks have been processed.

15

Page 16: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Extensions done to the basic algorithms:

Hierarchical classification: Train 7 classifiers in parallel Increases runtime by a factor of 3.

Per airline classification: Train 1+21 classifiers in parallel Increases runtime by a factor of 2.

16

General classifier

1 – Airline A classifier (21%)

2 - Airline B classifier (9%)

3 - Airline C classifier (7%)

21 – “Other” airlines (15.4%)

4 – Airline D classifier (6%) … …

Page 17: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Results: Overall accuracy

17

Dataset: 10% subsample of all records (class economy)

Page 18: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Results: Overall accuracy

18

Dataset: All records ZRH -> * (economy)

Page 19: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Results: Overall accuracy

19

Dataset: All records ZRH -> BKK (economy)

Page 20: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Results: Analyzing a single airline X

20

SVM classifier 0, for airline X, dataset 10% full subsample

Page 21: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Results: Analyzing a single airline X

21

SVM classifier 0, for airline X, dataset 10% full subsample

Page 22: 14.05.12 Analysis and Prediction of Flight Prices using Historical Pricing Data with Hadoop (Jérémie Miserez, ETH Zürich)

Questions!

22