Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Automated Repricing inComparison Shopping Agents:Price Prediction and PricingStrategy Extraction UsingDecision TreesMaster-ThesisManuel Zahn | 1397961Wirtschaftsinformatik
Fachgebiet WirtschaftsinformatikFachbereich Rechts- und Wirtschaftswis-senschaften
Manuel ZahnMatrikelnummer: 1397961Studiengang: Master Wirtschaftsinformatik
Master-ThesisThema: "Automated Repricing in Comparison Shopping Agents:Price Prediction and Pricing Strategy Extraction Using Decision Trees"
Eingereicht: 22.10.2016
Betreuerin: Dr. Irina Heimbach
Prof. Dr. Oliver HinzFachgebiet WirtschaftsinformatikFachbereich Rechts- und WirtschaftswissenschaftenTechnische Universität DarmstadtHochschulstraße 164289 Darmstadt
Prof. Dr. Johannes FürnkranzKnowledge Engineering GroupFachbereich InformatikTechnische Universität DarmstadtHochschulstraße 1064289 Darmstadt
In Kooperation mit:Patagona GmbHPoststraße 964293 Darmstadt
Erklärung zur Master-Thesis
Hiermit versichere ich, die vorliegende Master-Thesis ohne Hilfe Dritter nur mitden angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, dieaus Quellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeithat in gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen.
Darmstadt, den 22.10.2016
(Manuel Zahn)
Abstract
Prices on comparison shopping agents (CSAs) often emerge from automated complex rules likean alignment on competitor prices. Drawing conclusions from prices to their underlying pric-ing strategies is a major challenge. From an economic point of view, such pricing insights arecrucial, since they can be used for price prediction which enables better enforcement of ownpricing strategies. Following a strict divide-and-conquer concept, this thesis analyzes the feasi-bility of automated pricing strategy extraction and price prediction.
In a first step, 21.6 million offers are crawled from a German CSA. 100 products over a timespan of 80 days are represented in this recent and unique dataset. Subsequently, a fine-grainedmarket analysis has been conducted for multiple dimensions1 and multiple research fields. Thekey findings comprise: Detected daily minimum price change rates and detected reseller pricechange rates on every third day. Based on product minimum price change rates of up to everysecond hour the market analysis has shown that the dataset provides sufficient price dynamicsfor the purpose of gaining price insights.
Primarily, the problem has been simplified to the question: Is it possible to detect the priceseries’ origin by partitioning into manual and automated2 creation? This has been tested viasupervised classification. Experts have classified the price series. Decision tree algorithms arefurther used to classify the prices series based on a broad feature set. A comprehensive evalu-ation with 10-fold cross validation indicates that an automated repricing detection is possiblewith high accuracy.3 This result lays the foundation for the next tasks.
Mainly, the price prediction is performed with two different approaches. On the one hand, thereare time-series-based predictors which rely on the pure reseller price series using methods likesupport vector regression and arima models. On the other hand, there are feature-based algo-rithms which use a combination of decision and regression trees as key building blocks. A timeseries cross validation with up to 80 folds has been conducted. The feature-based algorithmsachieve promising forecasting results for different types of price changes. Up to 11% less pre-diction errors are made compared to a reference ’no price change’ predictor.
The pricing strategy extraction is based on a combined heuristic approach build on two classesof features and profound methods. For example, causality measures are used to identify com-petitor interlink strategies and motif discovery methods are used for extracting time-dependentstrategies. Based on 6,632 reseller price series, six different types of strategies are extracted.
The results of this thesis facilitate a deeper understanding of pricing mechanisms on CSAs andenable online retailers and repricing providers to be a step ahead of their competitors.
1 The market analysis dimensions encompass aggregation level, time, price and availability.2 In terms of ’by an automated repricing algorithm’.3 Reaching up to 97.11% area under the receiver operator characteristic (ROC) curve for the testing sets.
Contents
List of Figures 2
List of Tables 3
List of Algorithms 4
List of Abbreviations 5
1 Introduction 61.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Literature Review 102.1 Dynamic Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Comparison Shopping Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Price Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Customer Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.3 Reseller Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Dynamic Pricing Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Pricing Strategy Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 Price Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Market Review of Repricing Providers 243.1 Repricing Providers in Germany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Repricing Providers in USA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Competitive Market Analysis 294.1 Approach and Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 1D: All Offers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3.2 1D: Product Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3.3 1D: Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3.4 1D: Resellers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Analysis 395.1 Automated Repricing Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Price Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Contents 3
5.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Pricing Strategy Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6 Conclusion 76
Bibliography 78
A Product Selection Process 84
B Classification Feature Selection Algorithms 86
C Classification Classifiers Grid Search Configuration 88
D Evaluation of Different Balancing Schemes 89
E Detailed Classification Results 90
F Large Decision Tree Examples 92
G Prediction Classifier Grid Search Configuration 93
H Start Hour Prediction Comparison 94
I Detailed Minimum Price Prediction Results 95
J Detailed Reseller Price Prediction Results 97
Contents 1
List of Figures
1 The environment of a repricing provider (From an e-commerce perspective). . . . 72 A typical offer section from a CSA (idealo.de). . . . . . . . . . . . . . . . . . . . . 123 The market analysis concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 The offer origin on idealo.de. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Analysis of price trends on a day of the week base. . . . . . . . . . . . . . . . . . . . . 346 Delta analysis of all offers by different time horizons. . . . . . . . . . . . . . . . . . . 367 Product categories under consideration of different deltas. . . . . . . . . . . . . . . . 378 Product with GTIN 8628264 with two high frequency repricing resellers. . . . . . . 379 Analysis overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3910 A simple decision tree example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4011 5-fold cross validation partitioning scheme. . . . . . . . . . . . . . . . . . . . . . . . . 4212 The evaluation scheme of the automated repricing classification. . . . . . . . . . . . 4513 Automated repricing ratio of categories. . . . . . . . . . . . . . . . . . . . . . . . . . . 4714 A generated C4.5 tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4815 Classification prediction results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4816 Transition of the classification prediction results from theory to practice. . . . . . . 4917 5-fold time series cross validation partitioning scheme. . . . . . . . . . . . . . . . . . 5018 A simple regression tree example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5119 The price delta prediction concept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5420 The evaluation scheme of the decision/regression tree price predictor. . . . . . . . 5721 Minimum price prediction results for simple delta . . . . . . . . . . . . . . . . . . . . 6022 Minimum price prediction results for direction delta . . . . . . . . . . . . . . . . . . . 6123 Minimum price prediction results for absolute delta. . . . . . . . . . . . . . . . . . . . 6124 A grown M5 tree for minimum price prediction. . . . . . . . . . . . . . . . . . . . . . 6225 Reseller price delta prediction results of the car category. . . . . . . . . . . . . . . . . 6326 A grown M5 tree for reseller price prediction. . . . . . . . . . . . . . . . . . . . . . . . 6427 The pricing strategy extraction pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . 6728 Extracted pricing strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7029 Interlink between mein-reifen-outlet.de and giga-reifen.de. . . . . . . . . . . . . . . . 7130 Interlink between acom-pc.de and future-x.de. . . . . . . . . . . . . . . . . . . . . . . . 7131 Night time frame strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7232 Daily assortment repricing strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7333 The target position strategy in action (GTIN 3439602810019). . . . . . . . . . . . . 7434 Different balancing schemes with REP trees. . . . . . . . . . . . . . . . . . . . . . . . . 8935 A generated C4.5 tree of medium size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9236 A generated C4.5 tree of large size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9237 Start hours and RMSE stability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
List of Figures 2
List of Tables
1 Baseline DP directions based on Boer (2015) and Gönsch et al. (2013, p. 511). . . 102 Price dispersion explanation approaches based on Grover et al. (2006, pp. 300-
302). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Summary of observed strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 The underlying strategy parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Repricing providers in Germany. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Repricing providers in the USA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Examples of deltas and delta ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 The composition of a product quintuple. . . . . . . . . . . . . . . . . . . . . . . . . . . 299 The 100 selected products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3010 The different offer analyzers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3311 Minimum price trends of selected categories. . . . . . . . . . . . . . . . . . . . . . . . 3512 Overview of classification features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4413 Overview of time series prediction methods of R’s forecast package. . . . . . . . . . 5314 Overview of prediction features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5615 Price persistence ratios of the decision tree approaches (predictive car category
with all resellers). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6416 Correctly identified target position strategies. . . . . . . . . . . . . . . . . . . . . . . . 7317 Top 40 (10/25/2015 - 13 PM) of Billiger.de. . . . . . . . . . . . . . . . . . . . 8418 Top 40 mapped categories of Billiger.de. . . . . . . . . . . . . . . . . . . . . . . 8519 Product category selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8520 Classification grid search parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8821 Base results of the automated repricing classification. . . . . . . . . . . . . . . . . . . 9022 Detailed results of the automated repricing classification. . . . . . . . . . . . . . . . 9023 Preferred features of the automated repricing classifiers. . . . . . . . . . . . . . . . . 9124 Price prediction grid search parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 9325 Minimum price prediction results for daily simple price deltas. . . . . . . . . . . . . 9526 Minimum price prediction results for daily direction price deltas. . . . . . . . . . . . 9627 Minimum price prediction results for daily absolute price deltas. . . . . . . . . . . . 9628 Reseller price prediction results for the car product category. . . . . . . . . . . . . . 97
List of Tables 3
List of Algorithms
1 Random forest algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 Interlink strategy extractor scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693 Greedy feature selection algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864 Binary feature selection algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
List of Algorithms 4
List of Abbreviations
API Application Programming Interface
AR Automated Repricing
ARIMA Autoregressive Integrated Moving Average
BATS Box-Cox transform, ARMA errors, Trend, and Seasonal components
CSA Comparison Shopping Agent
DSHW Double-Seasonal Holt Winters
DP Dynamic Pricing
ETS Exponential Smoothing
GTIN Global Trade Item Number
HW Holt Winters
KPI Key Performance Indicator
LR Linear Regression
MAE Mean Absolute Error
MLP Multilayer Perceptron
MR Manual Repricing
NNETAR Neural Network Auto Regression
PPR Price Persistence Ratio
RMSE Root Mean Squared Error
ROC Receiver Operator Characteristic
SMOTE Synthetic Minority Over-sampling Technique
STL Seasonal and Trend decomposition using Loess
SVR Support Vector Regression
TBATS Trigonometric BATS
UTC Coordinated Universal Time
List of Abbreviations 5
1 Introduction
Resellers on comparison shopping agents often use sophisticated dynamic repricing policies for
setting prices. Drawing conclusions from prices to their underlying pricing strategies is a major
challenge. In this thesis, the market dynamics of a comparison shopping agent are analyzed and
machine learning methods are applied in order to reverse engineer pricing knowledge. Pricing
insights can have great impacts from an online retailer’s perspective, since future prices can be
predicted, which in turn is the key for pareto optimizing one’s own prices.
1.1 Motivation
Dynamic pricing describes price analysis and price adjustments of products or services in a
market environment, where prices can easily and frequently be adjusted (Boer 2015a, p. 2).
That characterizes the environment, where online resellers are operating. Often there is a
fine pricing line between minimizing lost sales and maximizing margins. Advanced pricing
strategies have to be implemented for reaching the online reseller’s goals. The application of
dynamic pricing software for automated price determination, so called repricing tools, is a key
factor in order to margins in electronic commerce. It can be assumed that such price intelligence
is emerging and wide-spread used of up to 25% by the online resellers (Baird and Rosenblum
2013, p. 21, (US); Skorupa 2014, p. 2, (Worldwide)).
However, repricing tools only consider the current pricing situation. Developing and ap-
plying pricing strategies is a crucial task for an online reseller. And manual implementation of
pricing strategies is arduous and time consuming and thus ineffective. On this account repricing
providers arose. They offer automated repricing services relying on a broad spectrum of pricing
strategies and corresponding parameters. Researchers denoted that repricing providers have to
learn, adapt to and anticipate in the dynamic e-commerce environment (Kephart et al. 2000,
p. 749). So what do those artificial pricing strategies look like and is it possible to deduce them?
The thesis is fostered by cooperation with a leading German repricing provider which grants
access to its crawling framework. The crawled dataset consists of 21.6 million offers for 100
products of a single German comparison shopping agent (CSA). Building on top of a fine-grained
market analysis, this master thesis analyses historic reseller price series on a CSA with machine
learning methods. Primarily, state-of-the-art decision tree approaches are applied. The intended
objective is to derive pricing strategies of resellers on product level by using exploited pricing
intelligence. If the pricing strategies of the competitors are known, the future pricing structure
can be predicted. Vice versa, new repricing algorithms can be developed in order to optimize
prices from an online retailer’s perspective. Further, this thesis will show that pricing strategies
do not necessarily have to be known in order to make good forecasts of price changes.
1.2 Research Problem
This subsection describes the environment of a repricing provider which is shown in figure 1.
On the one hand, there are customers, which may have a basket of desired goods at a
specific point in time. These customers pursue different goals like buying their products for the
1 Introduction 6
Online purchasewith
baskett of goods
Goals vMinimize costs vFast deliveryvTrustworthy Shop
Pricecomparison
w/o w/
For {1..n} products
Retailer
Top Hits
Assortmentt
Pareto PriceOptimization
automated manual
PricingStrategy
Competition-based
Demand-based
Inventory-based
Target Position
Timeframe
Pull-Up
Basis Parameters
Problem The price recommendation is based on historical data!
Customer Online Retailer
Comparison Shopping Agents
Price
Delivery Time
Criteria
w/ delivery
w/o delivery
GoalsvMaximize marginsvIncrease customer satisfaction
Min/Max pricePrice gap
Delivery costOwn shops
...
......
RepricingProvider
Consider Pricing Strategies
Crawl Offers
Price Recommendation
Map Criteria
w/ delivery
w/o delivery
Figure 1: The environment of a repricing provider (From an e-commerce perspective).
lowest price, buying only from trustworthy shops, buying only products with a fast delivery,
or a combination of the mentioned intentions. The goals are decomposed in parameters. The
customers may use a CSA.
A CSA acts as an intermediary and periodically aggregates pricing information from multiple
online retailers. This information is provided as a price overview at product level. A CSA typi-
cally collects further offer information like delivery time and retailer rankings.
On the other hand, there are online retailers who may decide to spend money for being listed on
the CSAs. An online retailer typically performs price adjustments in order to achieve his goals.
This process is further called pareto price optimization. The goals can reach from maximizing
his margins/sales/customer satisfaction to minimizing costs/stocks/delivery time. The pareto
price optimization can be performed either manually or automated by following a specific ap-
proach.
A competition-based automated approach is offered by repricing providers.4 The main repric-
ing service covers crawling offers from CSAs and calculating price recommendations according
to the chosen strategy of the online retailer. In this way the customer’s criteria are indirectly
represented. Usually the online retailer wants be a on top of the list in the CSA for the purpose
4 During the further course of this thesis automated competition-based repricing activities are associated withrepricing providers. However, there also exist online retailers who have self-developed solutions for thispurpose.
1 Introduction 7
of attracting many customers.
The repricing providers consider prices at a specific point in time (depending on the crawl-
ing interval). However, the calculated price recommendations are future-oriented and valid
until the next process of offer crawling. In a dynamic market environment the pricing strategy
of an online retailer may not be fulfilled. This is exactly where this thesis continues by trying
to derive the other online retailer’s pricing strategies and by predicting prices for the next time
frame.
1.3 Objective
This thesis aims to perform an in-depth examination of pricing dynamics and their underlying
pricing strategies on CSAs on both counts: theoretically and practically. In order to fulfill this
goal, the thesis addresses three central research questions:
1. What are the pricing dynamics on a CSA? Can price series provide enough information in
order to derive advanced pricing insights?
2. To which extent can pricing strategies be extracted?
3. How precise can prices be predicted using the gained pricing knowledge?
Fine-grained evaluation should be conducted by applying state-of-the-art machine learning
methods and comparable approaches. The focus on maching learning methods is determined
by decision trees. The developed approaches should be wrapped in proofs of concepts. This
thesis pioneers by contribution of:
• A thorough market analysis regarding price changes on a CSA
• Extracting pricing strategies on a real dataset of a CSA
• Applying a decision tree approach to predict prices on a CSA
• Predicting prices on a CSAfor all resellers instead of focussing on minimum prices
The practical implication consists of providing hints for transforming this thesis’ applied ap-
proaches into applicable features for repricing providers.
The scope is limited to a business perspective, more precisely a repricing provider per-
spective. Consumers and CSAs are considered as black boxes. This implicates, a consumer
perspective and corresponding questions (like how pricing strategies influence buying decisions
of different kind of consumers etc.) are out of scope for the evaluation. The CSAs are consid-
ered as black box online marketplaces. This means that questions like how can a CSA influence
prices on their platforms are out of scope, too. The price prediction is limited to the next pe-
riod. The thesis operates on historical price information. This implies that no active market
intervention takes place. It is assumed that information like internal sales or demand data is
not available for the repricing providers.
1 Introduction 8
1.4 Structure
The remainder of this thesis is structured as follows: The next chapter gives a conceptual
overview ranging from dynamic pricing basics leading to CSA mechanisms. In chapter three,
the market of repricing providers is examined with focus on Germany and USA. Based on these
findings, repricing strategies and corresponding parameters are derived. A fine-grained mar-
ket analysis of a German CSA is conducted in chapter four. The main chapter five comprises
evaluated proofs of concepts for automated repricing classification, price prediction and pricing
strategy extraction. The sixth chapter concludes the thesis.
1 Introduction 9
2 Literature Review
This chapter supplies background information reaching from dynamic pricing to price predic-
tion. Basic pricing mechanisms on CSAs are explained. Key messages from related work are
embedded in the context of the thesis’ subjects.
2.1 Dynamic Pricing
Dynamic Pricing (DP) describes price analysis and price adjustments of products or services in
a market environment, where prices can easily and frequently be adjusted (Boer 2015a, p. 2).
Such price adjustments are shaped by their realtime character (Lin and Sibdari 2009, p. 969).
Since online retailers operate in such a changeful environment (Boer 2014, p. 863) it is essential
to understand the basic processes and directions of DP. Particularly, regarding that pricing is a
vital aspect of reseller’s activities due to its close link to economic success (Kopalle et al. 2009).
DP has its origin in the travel industry decades ago and is subsequently deployed in the
retail industry (Chen and Chen 2014, p. 1). It enables resellers to increase revenue by synchro-
nizing supply with demand, to respond to dynamic demand patterns and to segment customers
(Chen and Chen 2014, p. 1). DP models can be distinguished into four baseline directions like
presented in table 1.
DP Direction Characterized by
Demand-based Further differentiation in static and dynamic demandcurves with different consumer types
Competition-based Number of competitors in the modeled market
Learning-based Pricing policies that consider uncertainty regardingthe relation between price and expected demand
Inventory-based Depends on reseller’s capacity (reaching from limitedto infinite inventory levels)
Table 1: Baseline DP directions based on Boer (2015) and Gönsch et al. (2013, p. 511).
Demand-based DP is denoted by customer differentiation. Most classical demand-based
DP models depend on a myopic costumer behavior, where the customers buy products as soon
as the price falls below their product valuations (Levin et al. 2009, p. 32). Otherwise, strategic
customer behavior takes future prices into account (Elmaghraby and Keskinocak 2003), which
can be seen as a counteraction to DP. Strategic customers take the option of delaying their
purchases into account (Levin et al. 2009, p. 41). Strategic behavior of customers can have
serious impacts on revenues when DP is not used (Levin et al. 2009, p. 32). Typically, a reseller
wants to adjust his prices aligned on the demand. This alignment has the intention of skimming
the costumer’s reservation prices (Lin and Sibdari 2009, p. 969). A fundamental problem can be
found in the reseller’s state of not knowing the consumers response to different selling prices.
Hence, the revenue-optimizing prices can not be known in advance (Boer 2015b, p. 1). Most
2 Literature Review 10
studies in this field are written under the assumption of stable demand functions, which is not
realistic.
The reseller has to pay attention, when applying customer segmentation with DP. Cus-
tomers expect price changes, as long as there are price changes in the past (Bergen et al. 2003,
p. 668). But there is a fine line between a negative reaction caused by the perception of price
discrimination and potential economical benefits. Especially, since this practice is based on loyal
customers with low price sensitivity (Weisstein et al. 2013, p. 505).
The most interesting direction for this thesis is located in the competition-based DP: This di-
rection requires monitoring of the competitor’s prices. A common assumption in corresponding
models is, that each reseller has in-depth knowledge about the market participants. Exem-
plarily, this knowledge may includes the reseller’s pricing strategy or remaining capacity and
the customer’s reservation prices or demand curves. This assumption is unrealistic (Sato and
Sawaki 2013, p. 223).
Lin and Sibdari (2009) develop a game-theoretic model for DP which is in accordance with
the basic nature of CSAs. It considers competition and price comparison shoppers. Their myopic
buying decisions are based on prices, inventory levels and reservation prices. However, Lin and
Sibdari (2009, p. 971) assume real-time inventory levels as public information.
Levin et al. (2009) develop a stochastic game-theoretic model with DP under competition
and a dedicated strategic customer model. The authors conclude, that strategic customers re-
duce the reseller’s profit. Additionally, if myopic customers are not considered as such, the
reseller’s profit decreases as well. This model requires perfect knowledge of the market infor-
mation including remaining capacity and market segments for both: resellers and customers.
Currie et al. (2007) present a DP model for airline tickets. It is characterized by limited
inventory, a fixed time constraint, finite horizon, changing ticket demand and two competitors.
The competitor’s prices can be modeled with any price function. This function needs to be
known in advance, but this could be improved by using forecasts as an alternative. The actual
optimization problem is solved by calculus of variations and Lagrangian multipliers.
DP with assortments needs special treatment by modeling cross-interactions (Kachani and
Shmatov 2010).
Learning-based DP tries to derive the relation between price and market response. Boer
(2015) uses historic sales data and a corresponding estimator. He forecasts sales via a slid-
ing window linear regression and by giving most recent sales higher weights. However, the
model is under the assumption of a monopolist.
Primarily, Resellers have to perform price experiments in order to learn about the price
which generates the highest profit (Boer 2014, p. 863).
DP can be further used for improving inventory and capacity management (Transchel and
Minner 2009). The interesting part of inventory-based DP is, that this approach indirectly
considers demand which has influence on the inventory.
2 Literature Review 11
From a reseller’s point of view, it’s paramount that the own pricing strategy is interwoven with
the competitors and has bidirectional effects (Kopalle et al. 2009).
2.2 Comparison Shopping Agents
A Comparison Shopping Agent (CSA), also known as price comparison website, shopbot or price
comparison engine, acts as intermediary between customer and reseller. The CSA periodically
aggregates objective data (e.g. prices) and quantified subjective data (e.g. service quality) from
multiple online retailers.
CSAs are a popular resource for strategic customers. In 2001, already 45.7% of hardware
online shoppers used CSAs (Zhang and Jing 2011, p. 3). According to a study of Aprimo (2012),
96% of smartphone users want to do price comparisons in future. Over 50% of smartphone
users use their device in local stores for price comparisons, whereas consumer electronics (39%)
is the top category for mobile price comparison.
A typical offer section from a CSA is shown in figure 2. It provides product description,
pricing information, availability, reseller reputation and an affiliate link. This information is
supplied to the customer as sorted, quickly accessible price overviews at product level.
Figure 2: A typical offer section from a CSA (idealo.de).
In general, CSAs can be categorized into the type of relationship to their resellers (Wan et al.
2003, pp. 500-501):
• Independent CSA: There exists no partnership and ads are displayed on the price compar-
ison website.
• Dependent CSA: There exists a contractually partnership whereas the reseller pays for the
offered services.
• Embedded CSA: The comparison mechanism is integrated like implemented by Amazon’s
marketplace.
According to Moraga-González and Wildenbeest (2011, p. 6), the business models of a CSA can
be distinguished based on their revenue model:
2 Literature Review 12
1. The customers don’t have to pay and the CSAs are charged either by a flat-fee or more
recently by cost-per-click. The fees can be category dependent like shown by Pricegrabber
or calculated based on transactions like on pricefight.com.
2. Free for both parties e.g. Google Shopping before February 2013.
3. The customers are charged, which is less common.
For example, geizhals.at is an Austrian dependent CSA which has implemented the
first business model. Sellers have to pay fixed fees for clickthroughs. The fee is reduced if
geizhals.at is embedded on the reseller’s web site. In Austria, electronic online resellers
can not afford to not be represented on geizhals.at(Hackl et al. 2014, p. 202).
There is considerable uncertainty about the quality of a CSA (Clement and Schreiber 2013,
pp. 265-269; Mei-Pochtler and Hepp 2013, p. 78). Since a CSA may not provide accurate and
complete information, customers have to use multiple CSAs or perform own online price re-
searches (Pathak 2012, p. 64; Zhang and Jing 2011). A high perceived quality of a CSA is impor-
tant, because it increases the customer’s purchase intention (Bretschneider et al. 2015, pp. 46-
51) and therefore the CSA’s revenue. There also exist less common meta/derivative CSAs which
have no direct connection to resellers, but rather crawl other CSAs (Wan et al. 2003, pp. 502-
503). Examples can be found in roboshopper.net or meta-preisvergleich.de.
Typically, a CSA only provides product price comparisons. A more sophisticated kind of prob-
lem is the price comparison of whole consumer baskets. This class of optimization problems is
known as the ’Internet Shopping Optimization Problem’. Błazewicz et al. (2010, pp. 386-387)
proof that this kind of problem is NP-hard5. The CSA geizhals.de provides such a consumer
basket optimization with a brute force approach and limited by a time constraint.
Pathak (2012, pp. 69-70) discovers significant temporal delays for prices between online
shops and CSAs up to 3.39 days for six major CSAs. Reasons for incomplete information can be
found in temporal delay and selection bias (Pathak 2012, p. 65).
2.2.1 Price Dispersion
During the early stages of electronic commerce, a transformation into archetypal economic mod-
els has been predicted (Brynjolfsson and Smith 2000). The media jumped on the economical
bandwagon and made auspicious promises (Economist 1999):
THE explosive growth of the Internet promises a new age of perfectly competitive mar-
kets. With perfect information about prices and products at their fingertips, consumers
can quickly and easily find the best deals. In this brave new world, retailers’ profit
margins will be competed away, as they are all forced to price at cost.
5 NP-hard is an algorithm complexity class in computer science, which is at least solvable in non-deterministicpolynomial time.
2 Literature Review 13
At a first glance, the prediction seems to be justified by findings like the following: A
CSA is characterized by almost zero sunk costs6, minimal resource requirements and market
transparency (Haynes and Thompson 2008a, p. 4; Haynes and Thompson 2008b, p. 471). Ho-
mogeneous products are offered, which in general are well suited for price comparison (Clement
and Schreiber 2013, pp. 267-268). A CSA enables a strong reduction of the customer’s search
costs (Bakos 1997; Ellison and Ellison 2009, p. 428). Further, a CSA reduces resellers ob-
fuscation techniques (Ellison and Ellison 2009) like false prices or tremendous delivery costs.
This reduction has been confirmed with test purchases (Baye; Morgan, and Scholten 2004,
p. 18). CSAs establish a higher level of price transparency and reduce information asymmetries
(Clement and Schreiber 2013, p. 285).
Why do all these characteristics not lead to price convergence in a multi reseller environment?
In theory, as long as all firms are Betrand oligopolists and the customers are fully informed,
all transactions take place at the perfectly competitive costs (Bakos 1997, pp. 3,10; Baye; Mor-
gan, and Scholten 2004, pp. 4-5,18). This statement is based on the Betrand model. The
Bertrand model implies, that competition should cut prices until the marginal production costs
are reached. The main assumptions are (Tirole 1988, pp. 209-212):
• The offered products are homogeneous products.
• At least two resellers operate in the market, in which the resellers don’t cooperate. More-
over, the resellers have to pay the same product costs.
• The customer is always a strategic buyer, who has no search costs and only makes pur-
chases at the lowest price.
Varian (1980) pioneers by presenting a model which explains price dispersion via search
theory. He differentiates uninformed customers, who buy at random local shops and informed
customers, who know the price distribution of the local shops e.g. by newspapers. The more
resellers, the higher the price dispersion, whereas the price dispersion is explained by different
search costs for the two types of customers. Contradictory, he predicts a positive correlation
between the number of resellers and the average selling price.
Baye and Morgan (2001) transfer the model of Varian to electronic markets, where CSAs
act as intermediary and connect customers and resellers. The customers are distinguished in
customers who use CSAs and customers who don’t. If all resellers are on the CSA this would
lead to Betrand competition which would further lead to price convergence. However, price
dispersion is desired by CSAs in order to ensure their business model. Hence, the reseller fees
must be high enough to prevent that all resellers operate in the CSA in order to maximize its
profit. A lack in this model consists of the assumption of buyer’s fees, which has not been en-
forced.
6 Sunk costs are already incurred irreversible costs.
2 Literature Review 14
Grover et al. (2006, pp. 300-302) conducted a meta analysis and identified three main ex-
planatory approaches for price dispersion in electronic markets. These findings are presented
in table 2.
Explanatory Approach Stated by Examplary Reasons
Search costs 10 Papers Reseller loyalty, reputation, product popularity
Service differentiation 6 Papers Fulfillment, ordering process, consumer satis-faction
Market characteristics 6 Papers Number of resellers, stage in product life cycle,average price
Table 2: Price dispersion explanation approaches based on Grover et al. (2006, pp. 300-302).
A key message for explaining price dispersion is the following correlation: The more re-
sellers offer a product, the higher the price dispersion (Haynes and Thompson 2008b, p. 467;
Baye; Morgan, and Scholten 2004) because new resellers introduce their offers with low prices
(Bounie et al. 2012, p. 10).
Hackl et al. (2014) performed a reseller’s margin analysis with data gathered from geizhals.
at combined with wholesale prices from a hardware producer. They rely on daily pricing data
for 70 digital cameras from January 2007 until December 2008. Hackl et al. (2014) observe,
that the more resellers, the lower their margins. The number of substitutes is also essential,
since it is negatively correlated with the margin. The older a product regarding its life cycle,
the lower its price and vice-versa the lower the margin (Hackl et al. 2014, p. 215).
Many researchers have contributed valuable insights on price dispersion enforced by the het-
erogeneity of the electronic market. On customer side, the main differentiation is between
strategic and myopic buyers (consideration of CSA or not) (Varian 1980, p. 652; Grover et al.
2006). Besides prices, customers consider shipping services, availability and reputation among
others (Klausegger 2009, p. 16). This variety of identified customer attributes of interest deliv-
ers supplementing explanation approaches of price dispersion (Zhang and Jing 2011, p. 2). So,
the resellers implement differentiation schemes in order to access different market segments
with the corresponding customer groups (Clay et al. 2001, p. 521).
Furthermore, researchers discovered: The more information overload7 the higher the price
dispersion in electronic markets. The more information equivocality8 the higher the price dis-
persion in electronic markets (Grover et al. 2006).
In summary, the electronic commerce reality has shown that price dispersion is pervasive.
Consequently, the prerequisite for different pricing strategies is ensured.
7 In terms of incomplete information: may lead to ineffective decisions.8 Online feedback systems like consumer ratings are needed for online buying decisions.
2 Literature Review 15
2.2.2 Customer Characteristics
Prices on CSAs have great impact on the customer’s price perception. As a result, prices on
CSAs serve as internal reference prices and acceptable price ranges (Jung et al. 2014, p. 2084;
Broeckelmann and Groeppel-Klein 2008). Notably, consumers are more sensitive to shipping
costs instead of item prices (Brynjolfsson and Smith 2001, p. 5).
The top three offer selection criteria in a CSA are price, availability and reseller rating. Sur-
prisingly, the main intention of using a CSA isn’t searching the lowest offers (42%), instead it’s
research about best fitting products and actually available manufacturers (51.3%). So, a CSA
has great influence on the manufacturer selection (69.4%). The study has been conducted by
the Austrian CSA geizhals.at by asking 2,000 of their users in 2009 (Klausegger 2009).
The position on CSAs is crucial for strategic customers. Findings from the search result page
clicktrough behavior can be transferred, since the results are ranked in the same manner. Pe-
trescu et al. (2014) analyzed 465,000 keywords on 5,000 websites of google search results.
67.6% of the clickthroughs are generated by the top five hits.
According to Brynjolfsson and Smith (2001, p. 15), 49% of the CSA users chose the lowest
offer for books on the former CSA evenbetter.com. Baye et al. (2009) reveal a difference
between first and second place results in a loss of 60% clicks. Their dataset consists of PDAs
offered on kelkoo.com. Further, they discover a loss of 17% in clickthrough rates for each
competitor positioned above. That’s a vital point for the further progress of this thesis, since
top positions in a CSA generate more clicks and hence sales. Therefore, it is reasonable that
resellers try to reach top positions with advanced pricing strategies.
2.2.3 Reseller Characteristics
The main intention for a reseller of being listed on CSAs is gaining more visibility in order to
increase the sales. Schieder and Lorenz (2012, pp. 18,20) have carried out a study about the
general usage of pricing intelligence with 44 online resellers. 30% of the resellers use methods
of ’dynamic price optimization’. 61.5% of resellers which use dynamic price optimization detect
clear profit increase.
A study from the Austrian CSA geizhals.at confirms the findings above: 60.7% of their
listed resellers observe increasements in sales and profits after being listed. The main reason
for being listed on a CSA is getting new customers. The results are based on an online survey
with 89 resellers (Klausegger 2011, pp. 7-8).
The more customers use CSAs, the higher the resellers pressure for being listed (Clement
and Schreiber 2013, p. 267). Unfortunately, customers show great loyalty to a CSA but not
to the resellers (Zhang and Jing 2011, p. 8). However, a good reseller reputation is important,
especially for resellers with a price premium (Bodur et al. 2015, p. 137). Hence, the reseller
rating (which quantifies the reseller reputation) is positively correlated with the reseller choice
made by customers on CSAs (Bodur et al. 2015, p. 135). Brynjolfsson and Smith (2001, p. 45)
2 Literature Review 16
discovered that retailers with highly rated reputation and previous visited retailers have a sig-
nificant price advantage of 3.1% and respectively 6.8% in customers view. Waldfogel and Chen
(2006, pp. 447-448) neglect the importance of reseller reputation. They state, that the more
CSAs are used, the less important is the reputation of the reseller. The reason for this assertion
can be found in the increasing price sensitivity and the accompanying decrease of loyal cus-
tomers (Kocas 2002, pp. 117-118).
The top three pricing business challenges are: Increased price sensitivity of consumers (55%),
increased pricing aggressiveness of competitors (48%) and increased price transparency (47%).
This statement originates from a study based on 123 worldwide online resellers (Baird and
Rosenblum 2015, p. 6).
Riekhof and Wurr (2013, p. 10) asked 231 German resellers for the main obstacles for
pricing. The top two answers are cost calculations (88%) and competitor analysis (70%).
Based on a data set from Amazon US/UK/FR Bounie et al. (2012, p. 1) analyze the Amazon
marketplace. They observe only every 20th day a reseller price adjustment. That’s remarkably
low, since the market analysis chapter 3 of this thesis reveals, that price adjustments of in aver-
age every third day are observed per reseller on a current dataset. The last point illustrates the
problems of the related work with focus on CSAs. Many of them lack of current datasets and
the multiple usage of old datasets, e.g. Bounie et al. (2012) use a dataset from 2006, Zhang and
Jing (2011) use a dataset from 2001 and Ellison and Ellison (2009) use a dataset from 2000-
2001. Since CSAs evolved, more and more repricing providers with high frequency repricing
enter the market (see chapter 3.1) and price comparison can be performed even easier by us-
ing dedicated mobile apps. So, some papers are already outdated before they are published.
This thesis tries to overcome that issue with an in-depth market analysis of a recent dataset.
The dataset is backed by a high frequency crawling interval of 15 minutes which results in an
unprecedented amount of crawled offers for CSAs in literature (21.6 million offers).
2.3 Dynamic Pricing Strategies
This chapter summarizes theoretic automated repricing approaches which are developed in the
literature. Different reseller pricing strategies are self-sustainable in order to map the heteroge-
neous buying strategies of the customers (Grover et al. 2006). Since manual repricing is slow
and expensive and thus inflexbile, a growing need for automated pricing strategies arose. Mul-
tiple pricing strategies have been developed and tested in simulated CSA environments:
Undercutting Strategy | competition-based | Deck and Wilson (2003)
This strategy’s main action relies in underbidding the lowest price by a fixed amount. Sup-
plementary, minimal and maximal price boundaries are set. As soon as the lowest price can’t
be reached, the maximal price is set. The resulting prices are the same as in a game-theoretic
prediction, so that’s what probably happens by manual price settings.
2 Literature Review 17
Low Price Matching Strategy | competition-based | Deck and Wilson (2003)
This strategy tries to match the lowest price. Price boundaries exist too. The next reachable
price is matched as long as the lowest offer can’t be reached. Compared to a game-theoretic
prediction, the resulting prices are higher.
Trigger Pricing Strategy | competition-based | Deck and Wilson (2003)
This strategy starts by setting an initial price. If another reseller is below or equal a threshold
(trigger), an associated new price is set. The resulting prices are lower compared to a game-
theoretic prediction.
Beat Half the Market Strategy | competition-based | Hertweck et al. (2009)
This strategy aims for a middle position in the CSA rankings.
Tiered Pricing Strategy | learning-based | Dasgupta and Melliar-Smith (2003)
Dasgupta and Melliar-Smith (2003) introduce a strategy which tries dynamic pricing by deriving
the customer’s intention of the purchase. The intention is distinguished in price-sensitive (com-
parison shopping) and price-insensitive by the reseller selection criterion and the historic pur-
chase behavior. The strategy tries to learn the buyer’s reservation prices. The price-insensitive
buyers are charged with higher prices. The prices for price-sensitive buyers are calculated by in-
corporating historical price and profit data in order to retrieve a polynomial fit. This fit is used
for prediction with non-linear regression of future prices and profits. In theory, this strategy
can increase the reseller’s profit up to 20%. The feasibility of the derivation of the customer’s
purchase intention has been confirmed by other researchers e.g. Moe (2003). She shows,
that resellers have the opportunity to differentiate the shopping behavior of their customers by
analyzing clickstream data. She classifies the derivable shop visitation behavior into directed
buying, search/deliberation, hedonic browsing and knowledge building.
Reinforcement Learning Strategy | learning-based | Kephart et al. (2000)
The reinforcement learning9 strategy is based on Q-Learning and learns anticipated future dis-
counted profits. Subsequently, the repricing policy with the highest future discounted profit is
chosen.
Profit Price Adaption Strategy | learning-based | Kutschinski et al. (2003)
This strategy estimates profits based on current price and price/profit history with a single state
Q-Learner.
Q-Learner Strategy | learning-based | Kutschinski et al. (2003)
This strategy uses Q-Learning in combination with a Boltzmann price selection mechanism. At
the beginning, this mechanism allows a wide range of possible profit functions and keeps getting
9 Reinforcement learning is learning from feedback (Kutschinski et al. 2003, p. 2209). Q-Learning (Watkins andDayan 1992) is a reinforcement learning technique which is based on dynamic programming. It learns to actoptimally in Markovian environments by experiencing the consequences of actions.
2 Literature Review 18
more restrictive after each iteration. It learns a profit function and tries to undercut competitors.
Derivative Following Strategy | key performance indicator | Kephart et al. (2000)
This strategy is detached from competition and costumers. It consists of incremental price
changes in one direction. The strategy’s pricing behavior can be aligned on key performance
indicators like profitability or revenue. As soon as the indicator decreases, the price change
direction is reversed.
The adjustments can be enhanced by adaptive stepwise price adjustments (Dasgupta and
Das 2000, p. 4).
Goal-directed Strategy | inventory-based | DiMicco et al. (2001)
This strategy is an inventory-based variation of the Derivative Following Strategy. The main
input parameter is a time span, at which end, a product should be sold. The strategy adjusts
the product prices according to the inventory level and adapts based on inventory changes and
time progress. There is no direct consideration of competitors and buyers.
Ramezani et al. (2011) present an advanced Goal-directed Strategy. It focuses on the num-
ber of products sold and the corresponding changes in inventory. An evolutionary algorithm is
used for optimizing pricing step amplitudes and price change thresholds.
Game-theoretic Strategy | game-theoretic | Kephart et al. (2000)
The strategy calculates a random distribution of prices considering ratios of strategic buyers,
buyer’s reservation prices and the number of resellers. This strategy lacks of the premised
knowledge in advance.
In general, the nature of the presented competition-based approaches is over-simplistic e.g.
they can be reduced to modifications of the later introduced Target Position Strategy. Chapter
3 shows, that in practice, the used pricing strategies are much more elaborated with a phalanx
of adjustable parameters. Furthermore, the complexity of repricing strategies increases due to
the increasing number of configuration parameters (e.g. internal clickstream data like basket
activities, sales, number of product views) (Meyer 2012, pp. 69-70).
The learning-based strategies, which all rely on Q-Learning, suffer from slow learning rates
and unrealistic assumptions of the economic environment. The reason for that is the markov
property, which has to be fulfilled. It states that the environment is not allowed to change
during learning (Kutschinski et al. 2003, p. 2209).
All theoretic strategies are somehow probed under simulated markets but not under real
conditions. So, their actual impacts can’t be stated. In practice, Heynes and Thompson dis-
cover a skimming price strategy. They observe up to 35% reseller fluctuations per week on
nextag.com. This observation can be traced back to so called ’hit-and-run’ pricing strate-
gies, where resellers enter the market with low prices for a short period of time until they exit
(Haynes and Thompson 2008a, p. 19; Haynes and Thompson 2008b, p. 467).
2 Literature Review 19
In airline ticket context, (Sato and Sawaki 2013) state that knowledge of the competitor’s
pricing strategy has great impact on maximizing the expected revenue. Generally speaking,
there exists no perfect pricing strategy. Depending on the degree of CSA usage and competitor
strategies, different pricing strategies are more effective (Hertweck et al. 2009, pp. 166-168).
2.4 Pricing Strategy Extraction
Pricing intelligence is exceedingly useful for the resellers. Presently, the top three applica-
tions are: weekly price reviews, adaptive price adjustments and monthly/quarterly key reviews.
This statement originates from a study based on 123 worldwide online resellers (Baird and
Rosenblum 2015, p. 13).
However, gained underlying pricing strategies go a step ahead and can be seen as blueprints
for pricing intelligence. To the best of my knowledge, there exists only a single paper which
addresses pricing strategy extraction on CSAs. Hertweck et al. (2010)’s approach consists of
two main stages:
1. Classifying strategies of competitors
2. Providing best counterstrategies in a simulated market
They model a market with one product, 1000 strategic and myopic customers, four competitors
and a 30 day horizon. The competitors use one of five common strategies: manual, lowest price
match, trigger, derivative following, beat half the market (see previous subchapter 2.3).
During the first stage, random competitor strategies are created. Based on the historic
prices, eleven basic statistic features are derived e.g. the number of prices changes, the price
standard derivation and the average position. A modular neuronal network is trained for each
strategy. The authors achieve a strategy accuracy of 65.3% up to 92.7%. At least three of four
competitor strategies are correctly identified by 85.9%.
The second stage consists of the calculation of a table containing the best counterstrategies
for all strategy combinations.
Hertweck et al. (2010) conclude that a profitability increase by 2.4% is possible in their simu-
lated environment.
Their concept shows several shortcomings: As soon as their approach would consider all com-
petitors, disproportionately more computational effort would be needed for calculating all com-
binations. Furthermore, the more strategies have to be extracted, the less is the probability
to guess them all correct. However, concentrating on the top four competitors is a promising
approach to reduce complexity and achieves good results. Comparing the used features, this
thesis’ features provide a wider spectrum of sophisticated measures (see table 12 with 40 his-
toric features and table 14 with 15 current features). Hertweck et al. (2010) train and evaluate
their model in a synthetic environment, whereas in this thesis a real dataset is used. Finally, the
resellers apply more advanced strategies (see chapter 3), which is by far not covered with the
applied five basic strategies.
2 Literature Review 20
2.5 Price Prediction
The closest and most sophisticated approach has been provided by Decide with their service on
decide.com.10 They offered a paid service for predicting the best time when to buy a product
online. Decide analyzed the offer price history on CSAs and informed their customers as soon as
the lowest price has been predicted for the next two weeks. Supplementary, they granted a price
guarantee for compensating suboptimal purchase recommendations: Once a cheaper product
price has been offered within two weeks, the difference to cheapest offer has been paid. Decide
was acquired by eBay on September 201311 and the service was shut down. Decide’s approach
is based on the co-founder’s previous paper:
Etzioni et al. (2003) predict prices for airline tickets and make recommendations for pur-
chase decisions. They combine multiple techniques. Ripper (Cohen 1995) is used as separate
and conquer rule learner based on flight number, remaining hours until departure, current price
and airline. Subsequently, Q-Learning is applied for making purchase decisions for the next in-
terval. It’s based on reinforcement learning and optimizes decisions based on discounted future
rewards (negative and positive). A moving average model, which is a time series prediction
technique, is used to make a secondary purchase decision. All techniques are combined via
stacked generalization into aggregated rules. Etzioni et al. (2003) achieve 4.4% savings of
ticket prices based on a real data set containing 12,000 ticket prices within a 41 day period.
Their savings correspond to 61.8% of the overall possible savings.
Principally, related papers concerning ticket prices in the airline industry are well suited for
the context of this thesis. The airline industry varies prices by seasonality, availability and com-
petition (Etzioni et al. 2003, p. 119) similar to electronic commerce. Both domains deal with
uncertainty comprising future prices (Agrawal et al. 2011b, p. 709). There also operate inter-
mediaries in the form of flight CSAs and the ticket prices exhibit stepwise character too. The
differences are discussed later in this thesis.
A further analogy from the airline industry also addresses purchase recommendations of flight
tickets. Domínguez-Menchero et al. (2014) exploit the general nature of flight prices which be-
comes manifest in the negative correlation via the remaining days until the departure. Instead
of flight prices, their model uses the reciprocal saving rates of the ticket prices for a horizon of
30 days. Every route is estimated by isotonic regression which tries to find a best fit in a point
cloud with non-increasing piece-wise functions.
Groves and Gini (2015) provide a comprehensive approach which can be seen as state-of-
the-art airline ticket price prediction. Their algorithm can be applied for specified routes and
travel dates. Daily decisions are made with a time horizon of up to 60 days. The underlying
concept is composed of a dedicated feature selection algorithm which uses 92 historic and
current features. The current features refer to a particular point in time. Subsequently, a
10 Accessed via the internet archive: https://web.archive.org/web/20130614192602/https://www.decide.com (visited on 09/30/2016).
11 https://www.crunchbase.com/organization/decide-com (visited on 09/30/2016).
2 Literature Review 21
regression model based on partial least squares is applied. Ripper is used for creating decision
policies. Finally, a parameter search for finding the best configuration of the previous steps has
been applied. In an evaluation, their approach clearly outperforms the base line approach of
Etzioni et al. (2003).
This thesis employs the aspect of feature creation separation as well in chapter 5.2.2.
The main difference between the domain of airline tickets and consumer products can be
found in their nature of usage. The waiting time for consumer products is associated with a
loss in utilization. Airline tickets don’t exhibit a loss in utility by delaying the ticket purchase.
So, for consumer products, there exists a tradeoff between potential price drops and loss in
utility (Agrawal et al. 2011b, pp. 709-710; Agrawal et al. 2011a, p. 352). In contrast, this thesis
focuses on the pure price prediction.
Groves and Gini (2015, 3:5-3:6) detect strong and cyclic patterns in airline ticket prices.
Such patterns may crystallize better compared to electronic markets. Further, Groves and Gini
(2015, 3:7-3:8) differentiate three base types of competing airlines:
• The low category airlines which compete with the cheapest offers.
• The medium category airlines which perform aggressive pricing above the low category
airlines.
• The high category airlines which hold the price premium and rarely adjust their prices.
Compared to the complex strategies found in CSAs in chapter 3, these strategies are easier
to predict. Additionally, ticket prices get usually higher the closer to departure (Domínguez-
Menchero et al. 2014, p. 140), whereas product prices getting lower during their product
market cycle (Agrawal et al. 2011a, pp. 714-715). Hence, the approaches from airline in-
dustry background are a good baseline but they need adjustments for matching the case of CSA.
Agrawal et al. (2011) transfer the knowledge of airline ticket purchase recommendations to
the electronic commerce context. They implement a system that helps customers when to make
a buying decision. The system uses the price history and derived features in order to forecast
future price distributions with autoregressive models and smoothing methods like holt winters
(see chaper 5.2.1). The resulting price distributions are used for building recommendation
policies. Additionally, the authors take sales volume, seasonality and competitive products into
account. However, sales data and data of all surrogates products are not available in most cases.
Lucchese et al. (2012) use a hedonic model for price prediction via autoregressive models in
heterogeneous markets. It is based on the hedonic base assumption which states, that product
quality can be disassembled into product features. Further, the product quality can be associated
with a corresponding price. They consider multiple products and their co-dependencies.
The authors may achieve good results in simulated markets, but in practice all surrogates
would have to be known and needed to be tracked. Besides, product prices are afflicted by more
than the underlying product features e.g. by competition, reseller specific costs or seasonality.
2 Literature Review 22
Agrawal et al. (2011) develop an interesting modular concept for encapsulation of purchase
recommendations. They implement three different recommendation strategies for customers in
order to make product purchases on a CSA. Their strategies incorporate forecast algorithms as
black boxes. Further, a loss in utilization is considered which is connected to prolonged waiting
time for the usage of the desired products. The three recommendation strategies process the
calculated forecast distributions by evaluating, for example, the future maximum or average
expected utility.
Ahmad et al. (2016) investigate the offline version of price prediction under competition. They
provide three different approaches of determining local competitors. Subsequently, four vector
based autoregression models are used to predict the retail prices of nearby resellers via historic
prices. Besides, they also can predict wholesale prices.
Their approach is only transferable to a limited extent, since offline characteristics deter-
mine the prices. Such characteristics can be found in the geographical locations, which influence
the customer’s search costs or the predefined competitors.
Research about stock market prediction is not considered in this thesis. Stock prices under-
lay high frequency adjustments, whereas prices on CSAs are more stable and can be described
as piece-wise function. The stock market often depends on events which can abruptly influence
the stock prices. CSAs don’t show an event-triggered impact in large extent. The prices may be
influenced by events like manufacturer price drops, seasonality or campaigns. However, price
pattern recognition techniques from stock market can be employed like shown in chapter 5.3.2.
In summary, the example of decide.com has shown, that prediction of the lowest prices
is possible with high accuracy in the form of high level purchase recommendations. The field of
airline ticket prediction provides promising techniques like dedicated feature creation processes.
2 Literature Review 23
3 Market Review of Repricing Providers
A repricing provider is an agent that adjusts or recommends prices automatically on the seller’s
behalf in response to changing market conditions (Kephart et al. 2000, p. 732). Often, the term
’price optimization’ is used by the repricing providers. This term is inaccurate, since an optimum
can’t be achieved due to conflicting goals. Only pareto-optimal prices can be reached (Meyer
2012, p. 68).
This chapter provides information about the operating repricing providers in Germany and
in USA. The repricing provider’s websites served as data basis. Unfortunately, they supply only
sparse information of their available repricing strategies and underlying strategy parameters.
However, by considering a wide spectrum of repricing providers a realistic picture of repricing
characteristics crystallizes. The market review has been conducted in December of 2015.
Strategy Description
Target Position Strategy This strategy aims for a specified position in a CSA. It is accom-panied with a price gap parameter to the next competitor anda decision if delivery costs are considered.
Pull-Up Strategy This strategy is a specialized version of the target position strat-egy. First of all, it matches the desired target position. Subse-quently, the strategy raises the price by a predefined amountin the next iteration. If the competitor pulls up too, an upwardpricing spiral has been triggered.
Time Frame Strategy This is a meta strategy, which triggers different other strate-gies or pricing policies based on the current time. Commondifferences exhibit in day/night or workday/weekend.
Sole Vendor Strategy This is a repricing rule which applies as long as no other com-petitors offer the dedicated product on the CSA.
Interlink Strategy This strategy is characterized by an alignment on a specifiedcompetitor. A price gap may is chosen. The same result canbe achieved by applying a whitelist with one competitor to theTarget Position Strategy.
Buy-Box Strategy The main goal of this strategy is to step in Amazon’s buy-box.This is not necessarily achieved by the lowest price becauseother criteria like shop reputation and availability have to beconsidered too.
KPI Maximization Strategy This strategy has no direct reference to the competition since itadaptively orientates on economic key performance indicators(KPI) like sales or profit. Occasionally, customer behavior orseasonal components are incorporated.
Table 3: Summary of observed strategies.
3 Market Review of Repricing Providers 24
A summary of the observed strategy landscape is outlined in table 3. The most popular
strategy is the Target Position Strategy (competition-based). Table 4 shows the main part of
underlying strategy parameters.
Parameter Description
Price Boundary The price boundary specifies a valid price range for repricingactivities.
Gap This parameter defines a price gap according to the aimed com-petitor. The gap can be relative or absolute. A price gap of zeromeans matching the aimed competitor.
Consideration ofDelivery Costs
Are delivery costs for the position calculation on the CSA in-cluded?
Shop Reputation Usually, customers on CSAs have the opportunity to rate theresellers which results in this quantified parameter.
Availability This parameter expresses the availability of the product.
Blacklist This kind of list contains competitors which are excluded fromthe repricing activities.
Whitelist This kind of list contains only competitors which are consid-ered for repricing.
Adjust-ToNextPricier
If the AdjustToNextPricier option is active and the desired tar-get position can’t be reached due to the price boundary, thetarget position is realigned on the next reachable competitor.
Table 4: The underlying strategy parameters.
3.1 Repricing Providers in Germany
The repricing providers in Germany rely on competition-based strategies like shown in ta-
ble 5. wisergermany.de is the only provider which links the competition-based strat-
egy with a demand-based strategy oriented on price elasticity. clousale.com advertises
a high frequency repricing interval of up to two minutes. They rely on the dedicated CSA APIs.
beny-software.de and patagona.de offer an adaptive crawling interval whose crawl-
ing frequency is determined by the degree of price deltas. Four repricers have a very generic
CSA handling by offering integration of further CSAs as required. The other repricers focus on
Amazon and Ebay. Typically, the pricing model orientates on the number of products.
3.2 Repricing Providers in USA
The repricing providers in USA have an emphasis on the Buy-Box Strategy for Amazon like shown
in table 6. Whereas the repricing providers normally hide strategy details, channelmax.net
3 Market Review of Repricing Providers 25
Repricing Provider CSAs Strategies Strategy Parameters Pricing Model
becoding.de unlimited Target Position N/A #Products
beny-software.de unlimited Target PositionPull-Up
Price Boundary, Delivery Time, ShopReputation, Delivery Costs Considera-tion
#Products, #CSAs, Setup
clousale.com Amazon, Ebay Target Position Price Boundary, Gap, Blacklist, Whitelist #Sales, #Products, Contract Duration
cludes.de/repricing Amazon Target Position Price Boundary or Wholesale Price*x,Delivery Time, Blacklist
#Products
jtl-software.de Amazon Buy-BoxSole Vendor
Price Boundary, Gap #Sales, #Products
logicsale.de Amazon, Ebay Target PositionTime Frame
Price Boundary, Gap, Blacklist #Products, Contract Duration
patagona.de unlimited Target PositionPull-UpTime Frame
Price Boundary, Gap, Delivery CostsConsideration, Blacklist, Whitelist,AdjustToNextPricier
#Products, Crawling Interval
preisanalytics.de unlimited Target Position Price Boundary, Gap #Products, #CSAs
priceparser.de Amazon Target Position Price Boundary, Gap, Shop Reputation,Delivery Costs Consideration, DeliveryTime, Blacklist, Whitelist
Software License
repricing.de Amazon Target Position Price Boundary, Gap, Blacklist, Whitelist #Products, #CSAs
wisergermany.de Amazon, Ebay Target PositionTime FrameSales
Price Boundary, Gap, Traffic, SalesSpeed, Conversion-Rate
#Products, #Shops
Table 5: Repricing providers in Germany.
supplies a public documentation in which more than 60 repricing parameters are specified.12
appeagle.com and ereprice.com provide instantly repricing by using the price delta
notifications of the Amazon API. darwinpricing.com is based on geographical customer
segmentation and learns price sensitivities of customers via A/B pricing tests. feedvisor.
com advertises a rule-independent full-automated repricing strategy which optimizes profit.
Unfortunately, they don’t give further details except that machine learning techniques are used
which mainly learn from sales data.
3.3 Discussion
There is a broad spectrum of strategies applied in practice. Most strategies are competition-
based and in some few cases machine learning methods are already used. The strategies and
underlying parameters go far beyond the strategies developed in literature (see chapter 2.3).
Buy-Box Repricing in Germany is not as popular as in the USA. In the US market dedicated CSAs
have been emerged e.g. with focus on Airbnb or geographical customer segmentation.
Price boundaries are crucial parameters for repricing activities. Missing price boundaries can
have great impact on the calculated prices. Eisen (2011) provides a good example for miss-
ing maximal boundaries on Amazon. A book from 1992 about the genetics of flies reached
23,698,655.93$ plus 3.99$ delivery costs. Two large book resellers were the only market par-
ticipants:
12 http://www.channelmax.info/wiki/mediawiki-1.15.1/index.php5?title=Talk:RepriceRule (visited on 10/08/2016).
3 Market Review of Repricing Providers 26
• On the one hand, reseller ’profnath’ targeted position one by providing a price of 0.9983
times aligned to the current lowest competitor.
• On the other hand, reseller ’bordeebook’ aimed for position two by providing a price of
1.27059 times aligned to the the current lowest competitor.
This constellation triggered an upward pricing spiral. Far more important are minimum price
boundaries. Missing or disregarded minimum prices can cause high losses. The repricing
providers appeagle.com and repricerexpress.com were responsible for one penny
listings on Amazon and corresponding high losses (Steiner 2012; Holland 2014).
In theory, repricing providers are vulnerable to price wars. So, machine learning techniques
are needed which account the future consequences of pricing (Kephart et al. 2000, p. 749).
In practice, price boundaries limit downward pricing spirals, as long as they are properly set.
However, appropriate countermeasures like the pull-up or profit maximization strategy exist
for further limitation. The more of such interactions are observed on a CSA, the higher the
possibility of exposing underlying pricing strategies.
3 Market Review of Repricing Providers 27
Repricing Provider CSAs Strategies Strategy Parameters Pricing Model
appeagle.com Amazon, Ebay, Rakuten Buy-Box Price Boundary, Gap, AdjustToNextPricier, Product Condi-tion, Shop Reputation, Blacklist
#Products, Features
beyondpricing.com Airbnb Profit Maximization Min/Basis Price, Tag, Events, Season, Neighborhood, De-mand
#Sales
bqool.com Amazon Buy-BoxProfit MaximizationInterlink
Price Boundary, Gap, Blacklist, AdjustToNextPricier #Products, Features
channelmax.net Amazon, Rakuten InterlinkBuy-BoxPull-UpTime Frame
Blacklist, Whitelist Price Boundary, Gap, Delivery Costs,Shop Reputation, Product Condition
#Products, Features
darwinpricing.com Own Shop Profit Maximization Location (based on price indices) #Sales, Features
ecomengine.com Amazon Target Position Price Boundary, Gap, Delivery Costs, Interval N/A
ereprice.com Amazon Buy-BoxSole Vendor
Price Boundary, Top N Competitors, Gap #Products
everbooked.com Airbnb Profit Maximization Price Boundary, Demand, Events, Weekday, Season, Avail-ability
#Sales, Features
feedvisor.com Amazon Buy-Box combined withPull-Up
Shop Reputation, Delivery Time, Delivery Costs, PriceElasticity, Prediction?
#Features, #Sales
get4it.com Amazon, Ebay, Bigcommerce Buy-Box Shop Reputation #Products
marketyze.com BestBuy, ReStockIt, sears, Ebay, N/A Rounding N/A
repriceit.com Amazon Target PositionTime Frame
Price Boundary, Delivery Costs, Product Condition, #Offers,Shop Reputation
#Products
repricerexpress.com Amazon Buy-BoxTime Frame
Price Boundary, Gap, Blacklist, Delivery Time, Shop Reputa-tion, Product Condition, AdjustToNextPricier
#Products
solidcommerce.com Amazon, Ebay, sears, Rakuten, newegg,Overstock.com, etsy
Target PositionTime FrameSole Vendor
Price Boundary, Shop Reputation, Product Condition, Costs,Gap, Amazon Costs
N/A
teikametrics.com Amazon, Ebay Profit Maximization Shop Reputation, Blacklist, Delivery Costs N/A
wiser.com see wisergermany.de
Table 6: Repricing providers in the USA.
3M
arketReviewofRepricing
Providers28
4 Competitive Market Analysis
This chapter examines the dataset consisting of 21.6 million offers which have been crawled
from a major German CSA. The focus relies on analyzing price changes in order to determine
the degree of pricing dynamics.
It is important to define a price change which is further called price delta as synonym.
Table 7 shows examples for delta calculations. Between n timestamps there can only exist n-1
deltas.
Reseller t1 t2 t3 Deltas PossibleDeltas
Delta Ratio
A 1 1 1 0 2 0.0
B 1 1 2 1 2 0.5
C 1 2 1 2 2 1.0
Table 7: Examples of deltas and delta ratios.
4.1 Approach and Settings
100 products have been chosen by a dedicated product selection process. The selection builds
on a top 40 ranking of the most popular products on the German CSA Billiger.de on 10/25/2015.
This ranking is used for filtering popular categories. Subsequently, a category distribution has
been retrieved. Twenty popular products reflecting the category distribution have been chosen
as reference products. Every product is further grouped to a quintuple by enriching with cor-
responding products. The composition of a quintuple can be obtained from table 8. The 100
selected products are presented in table 9. The described intermediary steps of the product
selection can be found in appendix A.
Distinction Main Dimension
Reference (Base)
Variant Configuration/Appearance
Predecessor Publication Date
Substitute Manufacturer
Cheap Substitute Price
Table 8: The composition of a product quintuple.
The crawling framework of Patagona has been used. It enables offer extraction on a wide
range of European CSAs by high frequency crawling intervals. The crawling framework is fur-
ther treated as black box. idealo.de has been chosen as target CSA since it represents a
4 Competitive Market Analysis 29
Category Quintuple Id Product GTIN Ref
eren
ce
Vari
ant
Pred
eces
sor
Subs
titu
te
Che
apSu
bsti
tute
1 Samsung Galaxy S6 32GB Black Sapphire 8806086676137 �
2 Samsung Galaxy S6 32GB Gold Platinum 8806086936651 �
Q01-G-III 3 Samsung Galaxy S5 16GB Charcoal Black 4250698798406 �
4 Apple iPhone 6 64GB Spacegrau 0888462064101 �
5 Sony Xperia Z3 Compact Black 7311271485889 �
6 Motorola Moto X (2. Generation) 16GB Schwarz 6947681520554 �
7 Motorola Moto X (2. Generation) 32GB Schwarz 6947681521735 �
Q02-G-II 8 Motorola Moto X Walnuß 6947681521148 �
9 LG G3 16GB Schwarz 8806084958235 �
10 Huawei Ascend P7 Schwarz 6901443004836 �
11 Microsoft Lumia 640 schwarz 6438158728189 �
12 Microsoft Lumia 640 orange 6438158724808 �
Q03-G-II 13 Nokia Lumia 635 Schwarz 6438158708068 �
14 Motorola Moto G (2. Generation) 8GB Schwarz 6947681519374 �
15 Samsung Galaxy J1 Schwarz 8806086669122 �
Smartphone 16 Wiko Rainbow Jam 8GB schwarz 4016138998450 �
17 Wiko Rainbow Jam 16GB schwarz 4016138998528 �
Q04-G-II 18 Wiko Rainbow Schwarz 6297000671000 �
19 Motorola Moto E (2. Generation) schwarz 6947681523258 �
20 Sony Xperia E1 Black 4055432001978 �
21 Honor 7 grau 0637825998191 �
22 Honor 7 silber 6901443074020 �
Q05-G-III 23 Honor 6 schwarz 6901443026623 �
24 Huawei P8 Titanium Grey 6901443056705 �
25 LG G Flex 2 16GB Platinum Silver 8806084978172 �
26 Samsung Galaxy S6 Edge+ 32GB Black Sapphire 8806086960687 �
27 Samsung Galaxy S6 Edge+ 64GB Black Sapphire 8806086960601 �
Q06-G-III 28 Samsung Galaxy Note 4 Charcoal Black 8806086371292 �
29 Apple iPhone 6 Plus 16GB Spacegrau 0888462039147 �
30 Motorola Moto X Play 16GB schwarz 6947681527683 �
31 Maxi-Cosi Pebble - Black Raven (2015) 8712930089186 �
32 Maxi-Cosi Pebble - Mosaic Blue (2014) 8712930090366 �
Q07-G-II 33 Maxi-Cosi Pebble - Total Black (2011) 8712930051329 �
34 Cybex Aton Q plus - Storm Cloud 4250183799697 �
35 Römer Baby Safe Plus SHR II Black Thunder 4000984096415 �
36 Gesslein S4 2014 (316000) 4250652384188 �
37 Gesslein S4 2014 (174000) 4250652384096 �
Q08-G-I 38 Gesslein S4 2013 (917000) 4250190167212 �
39 Maclaren BMW 5010902199219 �
Kids 40 Quinny Zapp Red Rumour 8712930081210 �
41 DerDieDas ErgoFlex Panther 4006047405613 �
42 DerDieDas ErgoFlex XL Panther 4006047406610 �
Q09-G-II 43 DerDieDas XLight Candy Castle 4006047404968 �
44 McNeill Ergo Light Plus Caro Softpink 4017245935987 �
45 Scout Buddy Street Soccer 4007953379111 �
46 Milupa Aptamil 2 (800 g) 4008976022336 �
47 Milupa Aptamil 3 (800 g) 4008976022343 �
Q10-V-I 48 Milupa Milumil 2 (800 g) 4008976032878 �
49 Töpfer Lactana Bio 2 (600 g) 4006303122001 �
50 Holle Bio-Folgemilch 2 (600 g) 7640104950394 �
51 Novartis Voltaren Schmerzgel forte 23,2 mg/g (150 g) 08628270 �
52 Novartis Voltaren Schmerzgel forte 23,2 mg/g (100 g) 08628264 �
Q11-V-I 53 Novartis Voltaren Schmerzgel (180 g) 06998784 �
54 Hermes Doc Ibuprofen Schmerzgel (150 g) 4058900010236 �
55 ratiopharm Diclofenac Gel (150 g) 0609788909156 �
56 Stada Grippostad C Kapseln (24 Stk.) 00571748 �
57 Stada Grippostad C Stickpack Granulat (12 Stk.) 09671871 �
Healthcare Q12-V-I 58 Stada Echinacea Classic Tropfen (50 ml) 01309337 �
59 Bayer Aspirin Complex Granulat (10 Stk.) 03227112 �
60 ratiopharm Grippal + C Brausetabletten (20 Stk.) 00999877 �
61 Thierry Mugler Alien Eau de Toilette (30 ml) 3439602810118 �
62 Thierry Mugler Alien Eau de Toilette (60 ml) 3439602810019 �
Q13-V-I 63 Thierry Mugler Womanity Eau pour Elles Eau de Toilette (50 ml) 3439601200118 �
64 Chloé Eau de Toilette (30 ml) 3607340309410 �
65 Jil Sander Eve Eau de Toilette (30 ml) 3607342216754 �
66 Weber Master-Touch GBS 57 cm Black 0077924033025
67 Weber Master-Touch GBS 57 cm Special Edition 0077924032950 �
Q14-G-II 68 Weber One-Touch Original 47 cm Black 0077924003592 �
69 Rösle No.1 Sport F60 Holzkohle-Kugelgrill 4004293250056 �
DIY & Garden 70 Landmann Kugelgrill Black Pearl Comfort (31341) 4000810313419 �
71 Bosch GSR 10,8 V-EC Professional (2 x 2,0 Ah, in L-Boxx) 3165140739108 �
72 Bosch GSR 10,8 V-EC Professional (2 x 2,5 Ah Akkus in L-Boxx) 3165140822114 �
Q15-G-II 73 Bosch GSR 10,8-2-LI Professional 2 x 2,0 Ah + L-Boxx (0 601 868 109) 3165140727495 �
74 DeWalt DCD790D2 (mit 2 x 2,0 Ah Akkus) 5035048410622 �
75 Einhell BT-CD 14,4 2B 4006825538250 �
76 Continental ContiWinterContact TS 830 P 205/55 R16 91H 4019238434033 �
77 Continental ContiWinterContact TS 830 P ContiSeal 205/55 R16 91H 4019238454291 �
Car Q16-G-I 78 Continental ContiWinterContact TS 850 205/55 R16 91H 4019238560688 �
79 Goodyear Ultra Grip 9 205/55 R16 91H 5452000447166 �
80 Nexen Winguard Snow’G 205/55 R16 91H 8807622186608 �
81 Canon EOS 700D Kit 18-55 mm Canon IS STM 3662362017743 �
82 Canon EOS 700D Kit 18-135 mm Canon IS STM 8714574602585 �
Photography Q17-G-III 83 Canon EOS 600D Kit 18-55 mm [Canon DC III] 4960999984094 �
84 Nikon D5300 Kit 18-55 mm Nikon VR II schwarz 0018208935871 �
85 Sony Alpha 58 Kit 18-55 mm 4013675005603 �
86 Philips Senseo Viva Café HD 7825/69 Schwarz 8710103761945 �
87 Philips Senseo Viva Café HD 7825/40 Sizzling Grape 8710103558033 �
Q18-G-I 88 Philips Senseo Original HD 7810/60 schwarz 8710103168836 �
89 Petra KM 42.17 Artenso latte schwarz glänzend 4211129758109 �
Home 90 Petra KM 34.00 4211129851701
91 Apple MacBook Air 13" 2015 (MJVE2D/A) 0888462348164 �
92 Apple MacBook Air 13" 2015 (MJVG2D/A) 4005922018313 �
Computer Q19-G-III 93 Apple MacBook Air 13" 2014 (MD761) 0885909943074 �
94 Asus Zenbook UX305FA-FC159T 4712900139884 �
95 Lenovo IdeaPad U330P (59424883) 0888772347536 �
96 Microsoft Xbox One 500GB 0885370808315 �
97 Microsoft Xbox One 1TB 0885370898279 �
Entertainment Q20-G-III 98 Microsoft Xbox 360 E 500GB 0885370767360 �
99 Sony PlayStation 4 (PS4) 500GB 0711719437017 �
100 Nintendo Wii U Basic Pack 0045496311018 �
Table 9: The 100 selected products.4 Competitive Market Analysis 30
major German CSA. Over a period of 80 days reaching from 11/1/2015 until 1/19/2016 offers
have been crawled. A high frequency crawling interval of 15 minutes has been applied. In total,
the dataset consists of 21,621,484 offers.
The main approach is defined by dividing the dataset in appropriate buckets via multiple
dimensions in order to enable fine-grained statements. The market analysis has been performed
on a computer with an Intel i7-6700 processor with 4x 3.4 GHz and 32 GB memory. The memory
size allows loading and processing the market data on-the-fly without need for using a database.
4.2 Implementation
Primarily scala13 is used as underlying programming language. Once the dataset has been
loaded from JSON product offer files, a multidimensional market analysis is conducted. The
basic concept is shown in figure 3. Initially, the offers are grouped on a high aggregation level.
Offers
➢ MarketPlaceAnalyzer➢ PriceAnalyzer➢ DeltaAnalyzer➢ MinPriceDeltaAnalyzer➢ DeliveryCostAnalyzer➢ ResellerNumberAnalyzer➢ OfferCountAnalyzer➢ PriceLeaderChangeAnalzer➢ PriceTrendAnalyzer➢ MinPriceTrendAnalyzer➢ DeliveryCostDeltaAnalyzer➢ Top3VendorAnalyzer
Apply Analyzers
Two-Dimensional Analysis
All Categories Quintuples Products Resellers
Hours
Weekdays
Yeardays
Price
Price Classes
Availability
Product Grouping Dimension (1D)
Leaf
Dim
ensi
on (
2D)
Offer Buckets(1D x 2D)
GnuplotTemplates
Chartsto plot
UseGnuplot
csv
Figure 3: The market analysis concept.
Either the offers are ungrouped, grouped on product level (product category, quintuple or on
product basis) or grouped on reseller level. Afterwards, the resulting offer segments are further
divided by the second dimension:
13 http://www.scala-lang.org
4 Competitive Market Analysis 31
1. Time 2D: This dimension should illustrate time dependencies between offers. The offers
are grouped by either the hour of day, day of week, or on day-of-the-year basis.
2. Price 2D: This dimension separates the products based on their average price. Either the
price is continuously mapped or the products are classified into three price classes: Class
I (0-100€), Class II (100-300€) and Class III (300-1500€).
3. Availability 2D: This dimension separates the offers in available and out of stock.
A dozen different analyzers are applied to the segmented offer buckets, . The analysis aspects
are ranging from simple task like counting the offers to more complex tasks like calculating the
degree of price leader changes. The analyzers are described in table 10.
Afterwards, the analysis results are written to csv files. Gnuplot14 is used for visualizing the
market analysis charts. Based on individual templates for each analyzer the charts are plotted.
The market analysis and gnuplot scripts are fully parallelized.
4.3 Results
The conduction of the market analysis inclusive plotting lasts two hours and produces 123,268
csv files and corresponding plots. An excerpt of the plots is presented in this subchapter with
focus on price deltas.
4.3.1 1D: All Offers
The dataset contains 21,621,484 offers from 1,589 distinct resellers. The average price of the
offers accounts for 283.45€ with average delivery costs of 2.73€. The top three resellers are
Amazon (66 products), otto.de (47 products) and jacob-computer.de (44 products). The average
delta ratio is 0.31% which corresponds to a price change on every third day per reseller. In
general there are 27% more price cuts than price hikes.
The average minimum price delta is 1.00% which corresponds to a daily minimum price
change rate. This means that three times more repricing activities can be expected on the
first position. Again, there are more minimum price cuts than minimum price hikes (16%).
Principally, more price cuts confirm the detected negative price trend with a slope of -0.000494
and a 95% confidence interval of 0.000140.
Surprisingly at a first glance, the price leader change ratio amounts to 1.64% and hence
is 64% higher than the minimum price delta ratio. The reason can be found in cases of price
classes where more than one price leader exist. If another reseller matches the first position
the minimum price remains. However, a price leader change has actually taken place and is
detected.
Remarkably high is the delivery cost delta ratio of 0.05% which corresponds to a delivery
cost change of every 20th day. A striking aspect is the equipartition of price cuts and hikes
which can be later deduced from time-based delivery cost patterns.
14 http://gnuplot.sourceforge.net
4 Competitive Market Analysis 32
Analyzer Description
MarketPlaceAnalyzer Offers on idealo.de are often provided indirectly via othermarketplaces. This analyzer groups the offers by their origin. Itdistinguishes the marketplaces of Amazon, Rakuten, Ebay andHitmeister. Finally, a marketplace distribution is calculated.
PriceAnalyzer This analyzer calculates basic price statistics in form of mini-mum, maximum, mean, geometric mean, standard deviationand variance.
DeliveryCostAnalyzer This analyzer calculates basic delivery costs statistics in formof minimum, maximum, mean, geometric mean, standard de-viation and variance.
DeltaAnalyzer This analyzer calculates the price delta ratio subdivided intothe delta directions. If a reseller offers multiple product vari-ants only the variant with the lowest price is considered. Note:If time dimensions are used, the most recent offers from theprevious bucket are considered too.
MinPriceDeltaAnalyzer This analyzer builds on the DeltaAnalyzer. However, it calcu-lates deltas only on the minimum price series of the products.
DeliveryCostDeltaAnalyzer This analyzer builds on the DeltaAnalyzer. It focusses on deliv-ery cost deltas.
PriceLeaderChangeAnalyzer This analyzer builds on the DeltaAnalyzer. However, priceleader changes are considered. A price leader change means ifthe current price leader(s) are not equal to the previous priceleader(s).
PriceTrendAnalyzer In order to assess the overall price trend, this analyzer calcu-lates a linear regression and returns the slope and the 95%confidence interval.
MinPriceTrendAnalyzer This analyzer builds on the PriceTrendAnalyzer. However, itcalculates the trend of the minimum prices.
ResellerNumberAnalyzer The ResellerNumberAnalyzer counts the distinct number of re-sellers.
OfferCountAnalyzer This analyzer returns the plain number of offers.
Top3VendorAnalyzer This analyzer returns the three vendors which offer the mostproducts of the current offer segment.
Table 10: The different offer analyzers.
4 Competitive Market Analysis 33
The offer origin can be obtained from figure 4 whereas 16% originate from marketplaces.
Hitmeister2%
Rakuten3%
Ebay
5%
Amazon
6%
Non-Marketplace
84%
Figure 4: The offer origin on idealo.de.
The hourly-based delta analysis reveals significant differences between daytime and nighttime
like shown in figure 6(a). The lowest price delta ratio is reached at 4 am (UTC) with 0.14%. At
8 am (UTC) the price delta ratio peaks with 0.56%. However, those differences have not been
confirmed for the min price delta ratio.
The delta analysis by weekdays is shown in figure 6(b) whereas one corresponds to Monday.
A clear difference between workday and weekday is evident. The average price delta ratio for
workdays is 0.35% and 0.23% for weekends. Further analysis of price leader change ratios,
minimum price delta ratios and delivery cost delta price ratios come to the same conclusion.
The price trend analysis in figure 5 reveals that mostly on Monday and Tuesday the down-
ward price trend peaks, whereas on Friday an increasing price trend is inclined. However, the
confidence spectrum is high which can be explained by high category differences.
Pric
eTr
end
Time [Weekdays]
95% confidencetrend value
−0.008
−0.006
−0.004
−0.002
0
0.002
0.004
0.006
1 2 3 4 5 6 7
Figure 5: Analysis of price trends on a day of the week base.
4 Competitive Market Analysis 34
Figure 6(c) shows the average price delta ratios distributed based on days. Day one rep-
resents 11/1/2015. The workday/weekend differences are once again underlined. Days 7/8,
14/15, 21/22 etc. are weekends. The price delta ratio reaches its maximum on November the
24th (Tuesday) with 0.83%. Actually, the whole week until November the 27th (Friday) reaches
a top plateau. This can be explained by Amazon’s black Friday week which took place at this
period. The period between Christmas and the beginning of 2016 are characterized by low
repricing activities (day 54 until day 64).
The day-based reseller number analysis reveals that between beginning and end of the month a
decrease of resellers takes place. E.g. in November the distinct reseller number drops from 591
to 560. Some resellers may have limited monthly budget for the listings on CSAs.
The offers can be separated in 15,753,939 available and 5,867,545 out of stock offers. An
offer is available if the delivery time is equal or less than two days. The price delta ratio of
available products (0.29%) is lower than the out of the stock counterpart (0.35%). The same
applies to the availability difference for the minimum price delta ratios (0.97% vs. 1.38%).
The minimum price trend accounts to -0.006568 (0.000362 confidence) and -0.000913
(0.000166 confidence) to unavailable and respectively available products. A possible explana-
tion for the stronger downwards minimum price trend could be a compensation regarding the
longer waiting time.
4.3.2 1D: Product Categories
A category-based analysis of different deltas is presented in figure 8. Regarding the delta price
change ratios, the car category exhibits clearly the highest ratio with 1.96%. Regarding the
minimum delta price change ratios, the car category still exhibits the highest ratio. However,
the healthcare category holds the second highest ratio with 1.74% although in the plain delta
ratio it only reaches 0.17%. This difference by factor ten leads to the conclusion, that a limited
number of resellers apply high frequency repricing. The car category exhibits an average price
leader change ratio of 5.01% and peaks on 24th of November with 10.79%.
The electronic categories show a strong downward minimum price trend like shown in table
11.
Category Price Trend 95% Confidence
Entertainment -0.003617 0.000255
Computer -0.003073 0.000584
Smartphone -0.001848 0.000180
Table 11: Minimum price trends of selected categories.
4 Competitive Market Analysis 35
Del
taR
atio
Time UTC [Hours]
Delta Down RatioDelta Up Ratio
0
0.001
0.002
0.003
0.004
0.005
0.006
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
(a) Delta analysis of all offers by hours of day.
Del
taR
atio
Time [Weekdays]
Delta Down RatioDelta Up Ratio
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
1 2 3 4 5 6 7
(b) Delta analysis of all offers by weekdays.
Del
taR
atio
Time [Days]
Delta Down RatioDelta Up Ratio
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
(c) Delta analysis of all offers by days starting from 11/1/2015.
Figure 6: Delta analysis of all offers by different time horizons.
4.3.3 1D: Products
A car tire15 holds the highest average price delta ratio of 2.51%. Furthermore, this product
reaches a price leader change ratio peak on 1/11/2016 with 47.50%. A school backpack16 has
the overall lowest average price delta ratio of 0.01%.
In the first four days of the dataset, the overall minimum price delta is very high . This can
be traced back to products from the healthcare category e.g. GTINs 8628264, 571748, 3227112
and 4058900010236. This behavior is exemplarily shown in figure 8(a) for the product with
GTIN 8628264 (Novartis Voltaren Schmerzgel forte 23,2 mg/g (100 g)). The resellers apolux.de
and apotheke-online.de make high frequency jumps to predefined boundaries.
15 GTIN 401923843403316 GTIN 4017245935987
4 Competitive Market Analysis 36
Del
taR
atio
Categories
Delta Down RatioDelta Up Ratio
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
Smartphone
KidsHealthcare
DIYandGarden
CarPhotography
Home
Computer
Entertainment
(a) Delta analysis by product categories.
Min
Pric
eD
elta
Rat
io
Categories
Delta Down RatioDelta Up Ratio
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Smartphone
KidsHealthcare
DIYandGarden
CarPhotography
Home
Computer
Entertainment
(b) Minimum price delta analysis by product cate-gories.
Figure 7: Product categories under consideration of different deltas.
Del
taR
atio
Time [Days]
Delta Down RatioDelta Up Ratio
0
0.002
0.004
0.006
0.008
0.01
0.012
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
(a) Daily deltas of a single product.
Pric
e(E
uro)
Time
apotheke-online.deapolux.de
9.1
9.15
9.2
9.25
9.3
9.35
9.4
9.45
9.5
01.11/00h
01.11/12h
02.11/00h
02.11/12h
03.11/00h
03.11/12h
04.11/00h
04.11/12h
05.11/00h
(b) Excerpt of a price series with alternatingprice leaders.
Figure 8: Product with GTIN 8628264 with two high frequency repricing resellers.
Another car tire17 exhibits the highest average minimum price delta rate of 7.86% which corre-
sponds to a minimum price change within every two hours.
4.3.4 1D: Resellers
This subsection provides three examples of resellers and their repricing activities.
Amazon shows a price change rate of 0.62% which is more than twice as high as the re-
seller’s average. The price hike and price cut ratios are nearly identical. The highest price
change rate is achieved on 12/15/2015 by 2.51%. The highest minimum price change rate is
reached on 12/27/2015 by 5.30%.
17 GTIN 4019238454291
4 Competitive Market Analysis 37
Mindfactory is a good example for explaining the high rate of delivery cost deltas. Mind-
factory offers ’Midnight-Shopping’18. In order to have full order books at morning, Mindfactory
drops the delivery costs from midnight to 6 am for a major part of its assortment. So, the de-
tected average delivery cost delta at midnight (CET) is 31.90%. Overall, Mindfactory reveals an
average delivery cost delta of 2.24%. The delivery cost hikes and cuts are equally distributed.
good-tires [ebay.de] is the reseller with the highest average price delta ratio.19 It accounts
for 55.21% and the price hikes and cuts are equally distributed too.
4.4 Discussion
This chapter focussed on price deltas since they express pricing strategy actions on operational
level. If and only if enough price changes come into the picture, drawing conclusions about
price patterns and pricing strategies is viable. This analysis recognizes an almost seven times
higher price change rate compared to Bounie et al. (2012, p. 1). Potential high frequency market
dynamics have been detected which may be sufficient for deriving pricing strategies and fore-
casting prices. Especially on the first price position are notably repricing activities. Events like
Christmas or Amazon’s black Friday can express itself in alternating price deltas. However, there
are great differences between product categories in price changes. Time-dependent repricing
activities have been discovered such as differences between workday/weekend and day/night.
That may be an indication of automated repricing since e.g. during night the manual repricing
activities should be on a minimum. Vice versa automated repricers should remain.
18 http://www.mindfactory.de/info_center.php/icID/1619 for a reseller with at least 1,000 offers
4 Competitive Market Analysis 38
5 Analysis
This chapter’s primary objective is the derivation of pricing insights based on reseller price
histories of a CSA. The foundation of all three tasks is a preprocessing step which calculates
features for the price series. Breaking down the objective, three main tasks have been conducted
like shown in figure 9:
Feature Set Extraction
Chapter 5.1 Automated Repricing Classification
SupervisedClassification
Chapter 5.3Pricing Strategy Extraction
Chapter 5.2Price Prediction
Price Delta Features
Price Features
Position Features
Price Gap Features
SerialExtraction
Price DeltaTransformation
Figure 9: Analysis overview.
1. Task: Supervised classification of reseller price series regarding automated repricing
2. Task: Prediction of minimum and reseller price deltas
3. Task: Pricing strategy extraction with six strategy-dependent extractors arranged in a
serial extraction process
The dataset, presented in the previous chapter 4 serves as base data. The analysis has
been conducted on a computer with an Intel i7-6700 processor with 4x 3.4 GHz and 32 GB
memory. Regarding the implementation, Scala is used as underlying programming language.
All analysis steps are fully parallelized. Weka20 in version 3.8 has been used for the decision tree
approaches. Every following subchapter has a concept section which explains the fundamental
concepts used in the approaches.
20 http://www.cs.waikato.ac.nz/ml/weka
5 Analysis 39
5.1 Automated Repricing Classification
The main goal of this subsection is to ascertain the possibility for detecting the origin of a
price series on a CSA. This task distinguishes between manual and automated repricing. Only
if this problem is solvable more in-depth pricing insights are conceivable. In summary, the ap-
proach consists of supervised learning. The historic price series have been classified by repricing
domain experts. Subsequently, decision tree algorithms are applied and validated in a compre-
hensive evaluation. A transition step is performed in order to deduce conclusions for real world
application of the presented approaches.
5.1.1 Concepts
Supervised classification learning is a learning scheme which uses a set of classified examples
(training set) in order to classify unseen examples21 (testing set). This process is called super-
vised since it is provided with the training set inclusive the actual outcome in form of a class.
The testing set is used for applying metrics for determining the success rate (Witten and Frank
2005, pp. 42-43).
Decision tree approaches are a common way for supervised classification learning. Attributes
from the training set are used to establish a tree structure. The tree nodes contain rules for
attribute examinations and the edges represent the actual decisions. The leaf nodes determine
the classes (Witten and Frank 2005, p. 62). A simple decision tree example is shown in figure
10.
Time until due date of master thesis
write thesis
< 30 days
weather
>= 30 days
cycling
good
write thesis
bad
Figure 10: A simple decision tree example.
C4.5 is a decision tree approach which is based on a greedy divide-and-conquer concept. At-
tributes are added top down during tree construction according to their information gain22.
21 Examples are further called instances. Every instance is determined by values of predefined attributes.22 Information gain corresponds to an increase in the average purity of the subsets.
5 Analysis 40
C4.5 uses two pruning concepts for preventing overfitting23. On the one hand, prepruning is
used during the decision tree growing process. A parameter for setting the minimum num-
ber of instances per leaf node can be specified. Additionally, C4.5 only allows pruning for
nodes with at least two successors. On the other hand, a confidence-based postpruning is used.
Since the actual error for the testing set is not known in advance, such error has to be es-
timated by confidence intervals on each node. The confidence intervals and a parametrized
confidence level determine the pruning decision. Furthermore, C4.5 supports missing values,
discrete and continuous attributes and weighting of attributes (Quinlan 1993; Witten and Frank
2005, pp. 187-199).
A REP tree is a decision tree approach which also takes advantage of splitting by information
gain. It uses reduced error pruning. This postpruning technique estimates the expected testing
set error by holding back a training subset. However, the actual tree is build on less training
data. REP trees are optimized for fast execution due to sorting numeric attributes only once. It
supports prepruning with minimum instances and the missing values treatment of C4.5 (Witten
and Frank 2005, pp. 193-194,407-408).
Random forest is a multiple decision tree approach which builds in each iteration a predefined
set of randomized decision trees. The basic algorithm is presented in algorithm 1. The forest
remains unpruned. Unknown instances can be classified by considering all built trees by a
majority vote (Breiman 2001; Witten and Frank 2005, pp. 320-321).
Algorithm 1 Random forest algorithm.for a predefined number of trees do
bootst rap← Select a random training subsetwhile growing a tree with bootst rap do
for every node doSelect a random subset of attributesSplit the node by e.g. information gain
end forend while
end for
Cross validation is an accuracy estimation method. The dataset is randomized and split into
k subsets (folds) of approximately the same size. During k iterations each subset is exactly
used once as testing set whereas the k-1 other subsets are used as training data. An example
of a five-fold cross validation can be found in figure 11. A ten-fold cross validation has become
the standard accuracy estimation method (Kohavi 1995; Witten and Frank 2005, pp. 149-151).
Cross validation can prevent overfitting (Hsu et al. 2003, p. 5).
23 Overfitting describes the problem that occurs if a decision tree’s complexity is too high. In this case, thedecision tree may be well suited for the training set, but it is overfitted due to exactly matching the dynamicsof the training set.
5 Analysis 41
Dataset
1st Fold
2nd Fold
3rd Fold
4th Fold
5th Fold
Figure 11: 5-fold cross validation partitioning scheme.
Grid search is a naive approach for finding a good classifier configuration. A classifier is trained
via trying different predetermined configurations. The configuration with the highest prediction
accuracy is chosen (Hsu et al. 2003, pp. 5-8).
Feature selection approaches try to remove irrelevant and redundant attributes24. Sophis-
ticated wrapper approaches have been developed for the purpose of feature subset selection.
These approaches consider the underlying classifier as black box and measure the classifier per-
formance during feature selection (Kohavi and John 1997; Guyon and Elisseeff 2003).
Auto balancing describes the process of establishing a class balance within datasets. A dataset
is imbalanced if their classes are not equipartioned. Imbalanced datasets have major impacts
on the classifier and prediction accuracy. Classifiers tend to make strong optimizations of their
models for the majority class. At first glance, this leads to good prediction results. However, in
such cases the prediction accuracy of the minority class is often very low.
Therefore, countermeasures have been developed. A simple solution is to establish class
balance by:
• Reweighting the classes
• Reducing the number of instances of the majority class (undersampling)
• Increasing the number of instances of the minority class by injecting duplicates (oversam-
pling)
Chawla et al. (2002) developed SMOTE25 which is a more sophisticated method to over-
come imbalanced datasets. SMOTE generates synthetic instances of the minority class by using
a predefined number of nearest neighbours. Additionally, undersampling of the majority class
is recommended.
In order to measure prediction accuracy various metrics have been developed (Witten and
Frank 2005, pp. 168-173):
24 Features are a synonym for attributes.25 SMOTE stands for Synthetic Minority Over-sampling Technique.
5 Analysis 42
• ROC stands for receiver operating characteristic and describes the trade off between true
positive rate (y-axis) and false positive rate (x-axis): The higher the area under the ROC
curve, the better the model. This metric defines the point until the classifier can separate
between the determined classes.
• Precision (number of documents retrieved that are relevanttotal number of documents that are retrieved ) is a metric for accuracy.
• Recall (number of documents retrieved that are relevanttotal number of documents that are relevant ) is a metric for completeness.
• F-measure(2∗recal l∗precisionrecal l+precision ) combines precision and recall as harmonic mean.
5.1.2 Approach
Supervised classification learning techniques are used. Repricing domain experts have classified
the 7,300 price series on product level in either manual repricing or automated repricing. Clas-
sification criteria were for example: Very frequent price changes, fast price changes in order
to hold a position, price change patterns and price changes at conspicuous times. Two main
classification schemes of automated repricers are further distinguished:
• Pure: This class contains exactly the expert’s classification
• Injected: The injected class incorporates the identified automated repricers per product
from the pure class. Supplementary, if a reseller is at least for two products classified as
automated repricer, his whole assortment is tagged as automated repricing origin. The
basic idea behind this concept is that it is unreasonable for a reseller to apply automated
repricing only on one product.
Per reseller price series a wide range of features has been calculated. The main domains are
delta, gap, position and price feature. Features are tagged as meta feature if competition is
incorporated. The offers are prefiltered by dropping resellers which could not be determined
during the offer crawling process. Further, offers with inexplicable prices are filtered (prices
above 10,000€ or less than 0.01€). Table 12 shows the forty used features.
The features which are preferred by the decision tree classifiers are marked in the ’im-
portance’ column, whereas three stars means most important. Unsurprisingly, the price delta
features are the most selected features. However, other features are important too like the most
frequent cent ending. A manual repricing reseller may have more often prices which are ending
on 99 cents and automated repricing resellers may have other irregular cent endings. A high
distinct price ratio can only be reached if a lot of price changes haven taken place. Availability is
selected as important feature which confirms the observation from the chapter 4.3.1 that price
delta ratios differ regarding availability.
The evaluation scheme is presented in figure 12. The main goal of the evaluation is an
optimized classifier setup in order to provide good prediction results. The training set is auto
balanced by SMOTE since only 5% of the resellers are classified as automated repricers. If
5 Analysis 43
Category Nr Name Description Range Meta Importance
DELTA
1 avgDelta Degree of price changes in reseller price series [0..1] ***
2 avgDeltaToProduct Ratio of avgDelta to average degree of price changes ofproduct
[0..max] �
3 avgDeltaToMinPriceProduct Ratio of avgDelta to degree of min price changes of product [0..max] �
4 avgTop3ShortestChangeRatio Average of the top 3 shortest price change intervals inmilliseconds
[x..max] ***
5 deltaDownRatio Ratio of resellers price increases to all possible price changes [0..1] **
6 deltaUpRatio Ratio of resellers price decreases to all possible price changes [0..1]
7 downUpDeltaRatio Ratio of deltaDownRatio and deltaUpRatio [0..max] **
8 longestPlateau Longest period in days with no price changes [0..80]
9 mainDeltaTime Most frequent hour of resellers price changes [0..23]
10 maxDeltaDayRatio Highest resellers delta ratio achieved on a single day [0..1] **
11 nightDeltaRatio Ratio of how many reseller deltas are made between 23hand 7h
[0..1]
GAP
12 avgGapToMinPrice Relative gap of the reseller price series to the product minprice
[1..max] �
13 avgHigherGap Average absolute gap to next higher offer [0..max] �
14 avgLowerGap Average absolute gap to next lower offer [0..max] �
15 avgRelativeHigherGap Average gap to next higher offer compared to product minprice
[0..1] �
16 avgRelativeLowerGap Average gap to next lower offer compared to product minprice
[0..1] �
17 mainAbsoluteHigherGap Most frequent absolute gap to next higher offer [0..max] �
18 mainAbsoluteLowerGap Most frequent absolute gap to next lower offer [0..max] �
POSITION
19 avgPos Average position exclusive delivery costs [1..max] �
20 avgPosWithDelivery Average position inclusive delivery costs [1..max] �
21 degreeInTop3 Resellers degree in top 3 without delivery costs [0..1] � *
22 degreeInTop3WithDelivery Resellers degree in top 3 with delivery costs [0..1] �
23 degreeInTop10 Resellers degree in top 10 without delivery costs [0..1] �
24 degreeInTop10WithDelivery Resellers degree in top 10 with delivery costs [0..1] �
25 endogenousChangeRatio Ratio of endogenous position changes [0..1] �
26 exogenousChangeRatio Ratio of exogenous position changes [0..1] �
27 maxPosition Maximum position of the reseller without delivery costs [1..max] �
28 minPosition Minimum position of the reseller without delivery costs [1..max] �
29 positionSpan Difference between resellers max and min position withoutdelivery costs
[0..n] �
PRICE
30 avgPriceToProduct Ratio of average price to average product price [0..max] �
31 distinctPriceRatio Number of distinct prices [1..max] *
32 priceSegments Number of coherent price segments in reseller price series [1..max]
33 priceTrend Resellers price trend by linear regression [min..max]
34 priceTrendComparison Ratio resellers price trend to product price trend [min..max] �
35 relativeMedianSpan Relative price span between resellers median price and minprice
[0..1]
36 mostFrequentCentEnding Most often used cent amount by reseller [0..0.99] *
37 relativePriceSpan Relative price span between resellers max and min price [0..1]
REST
38 availability Degree of offer availability for delivery [0..1] *
39 numberOfResellers Number of resellers which are selling this product [1..max] �
40 offerRatio Ratio of average number of reseller offers to average numberof product offers
[0..max] � ***
Table 12: Overview of classification features.
5 Analysis 44
10-fold Cross Validation
DatasetsPure and derive
Injected
Offers
FilterDrop 1.8M
Offers (21.6M) Price Series (7.3K)
PreprocessingCalculation ofFeatures (#40)
ClassificationBy Experts
Training Set Testing Set
Classification Schemes Pure+Injected
Classifiers C4.5+REP Tree+Random Forest
Fea
ture
Wra
pp
er A
pp
roac
h
Feature Selectors Greedy+Binary
5-fold Cross Validation
Grid Search find best classifier config
Classification Schemes Pure+Injected
Classifiers C4.5+REP Tree+Random Forest
Eva
luat
ion
Optimized Features Greedy+Binary
Optimized Feature Setsfor particular evaluation context
UseTraining Set
Testing Setto predict
Applied Metric: Area Under ROC Curve
Autobalancing
Grid Search find best classifier config
Figure 12: The evaluation scheme of the automated repricing classification.
5 Analysis 45
no balancing scheme would be used, prediction models are built which nearly always predict
manual repricing. Appendix B covers the impacts of different balancing mechanisms on the
achieved metrics.
In order to get stable evaluation results a ten-fold cross validation is conducted. Three
different decision tree classifiers have been used: C4.5, REP tree and random forest. Pruning
has been activated for C4.5 and REP trees.
Two feature selectors have been developed. The greedy feature selector adds features in
an iterative way. In each iteration the feature with the highest positive metric gain is selected.
The binary feature selector iterates over all features and builds two classes: with and without
the current feature. Random samples are generated for each class. If the average metric of
the current feature-incorporating class is higher than the class without the current feature,
the current feature is added. Further details of the feature selection schemes can be found in
appendix B.
A naive grid search has been implemented in order to ensure the optimized configuration
of the used classifiers. This brute force approach is needed to ensure inter cross validation-fold
comparability. The grid search parameters can be obtained from appendix C. Feature selection
and grid search are combined in a deep nested way for considering interdependencies.
Before the actual evaluation is triggered, the sophisticated feature wrapper approach is
applied with intention of filtering an optimized feature set. Afterwards, the actual evaluation
can be set up with the previous optimized features for different scenarios with the classification
schemes and classifiers. The applied metric is the area under the ROC curve since it represents
the general classification performance incorporating both classes.
5.1.3 Evaluation
The experts have classified 383 price series as artificial which amounts to 5.25%. The auto-
mated repricing ratio according to the categories is shown in figure 13. The car category is
characterized by the a high degree of 21.68% automated repricing. Hence, a car tyre26 scored
the highest automated repricing ratio with 41.98%.
Figure 15(a) presents the classification results with the pure classification scheme. The results
are the averaged from the 10-folds of cross validation. The full evaluation took 43 hours.
Random forest performs best and achieves with assistance of the binary random feature selector
an average predictive ROC area of 97.11% (testing ROC area was 98.59%). Hereby, the F-
measure accounts for 88.94% whereas the manual repricing part achieves 89.49% and the
automated repricing part reaches 87.33%.
The C4.5 classifier attains up to 95.94% predictive ROC area via the binary feature selector
(testing ROC area was 97.28%). This classifier scores the highest F-measure with 89.97% which
is one percent more than random forest. An example C4.5 tree with pure classification scheme
and binary feature selection is pictured in figure 14. The leafs contain the number of classi-
fied and misclassified instances. The strong correlation of price deltas to automated repricing
classification becomes apparent. Examples of larger C4.5 trees can be viewed in appendix F.
26 GTIN 4019238434033
5 Analysis 46
Aut
omat
edR
epri
cing
Rat
io
Categories
0
0.05
0.1
0.15
0.2
0.25
Smartphone
KidsHealthcare
DIYandGarden
CarPhotography
Home
Computer
Entertainment
Figure 13: Automated repricing ratio of categories.
The fast REP tree places third with up to 95.05% predictive ROC area.
If we look at the injected results in figure 15(b), the predictive ROC area drops significantly
by approximately 6% across all classifiers. The classifier ordering remains unchanged. Conse-
quently, random forest scores best with up to 91.48% predictive ROC area (testing ROC area
was 93.08%). A F-measure of up to 82.14% has been realized. In general, the prediction results
are still at a high level. Detailed results including all class-dependent metrics, tree sizes and
preferred features are shown in appendix E.
Across all scenarios, the binary sampling feature selector performs better with an average
ROC area of 92.67% compared to 91.99% by greedy. The binary feature selector selects in aver-
age thirteen features whereas the greedy feature selector chooses four. In summary, the binary
feature selector calculated three times more trees.
So, an interesting question arises:
Are the promising classification results transferable to practice?
In practice, a high frequency crawling interval of 15 minutes is expensive. Furthermore, price
series information of 80 days may is not available.
Therefore, the crawling interval and the time range will be synthetically reduced for the
subsequent analysis. The tests were conducted with the C4.5 classifier configured with the
binary feature selector using the pure classification scheme under a 10-fold cross validation.
Figure 16(a) shows slow metric decrease by reducing the crawling interval until the two times
daily interval. The ROC area falls from 95.94% (15 min interval) to 94.03% (720 min interval).
This interval incorporates the preferred features: degreeInTop10WithDelivery, deltaDownRatio
5 Analysis 47
maxDeltaDayRatio
deltaUpRatio
<= 0.041667
avgDelta
> 0.041667
manual (3486.0/1.0)
<= 0
deltaUpRatio
> 0
deltaDownRatio
<= 0.49976
avgTop3ShortestChangeRatio
> 0.49976
auto (57.0)
<= 0.52448
degreeInTop10
> 0.52448
manual (164.0/6.0)
<= 0.000264
mostFrequentCentEnding
> 0.000264
manual (135.0/14.0)
<= 0.000087
mostFrequentCentEnding
> 0.000087
auto (226.0/61.0)
<= 0.884916
manual (102.0/18.0)
> 0.884916
avgDelta
<= 66999589.561183
manual (1032.0/17.0)
> 66999589.561183
deltaUpRatio
<= 0.009324
manual (115.0)
> 0.009324
manual (102.0/28.0)
<= 0.625
auto (58.0/12.0)
> 0.625
degreeInTop10
<= 0.004461
longestPlateau
> 0.004461
manual (165.0/2.0)
<= 0.015873
avgDelta
> 0.015873
manual (64.0/10.0)
<= 0.001656
degreeInTop10
> 0.001656
degreeInTop3WithDelivery
<= 0.856523
manual (58.0/21.0)
> 0.856523
maxDeltaDayRatio
<= 0.167389
auto (80.0/10.0)
> 0.167389
manual (53.0/17.0)
<= 0.059536
auto (89.0/30.0)
> 0.059536
manual (84.0/17.0)
<= 0
avgTop3ShortestChangeRatio
> 0
degreeInTop10
<= 13801666
longestPlateau
> 13801666
maxPosition
<= 0.000009
auto (4202.0/102.0)
> 0.000009
manual (50.0/23.0)
<= 39.558625
auto (379.0/65.0)
> 39.558625
manual (70.0/14.0)
<= 7.057037
auto (201.0/36.0)
> 7.057037
Figure 14: A generated C4.5 tree.
RO
Car
ea
Feature Selection Mechanism
REP treeC4.5Random Forest
0.945
0.95
0.955
0.96
0.965
0.97
0.975
0.98
Binary
Greedy
(a) Classification prediction results withpure classification scheme.
RO
Car
ea
Feature Selection Mechanism
REP treeC4.5Random Forest
0.86
0.87
0.88
0.89
0.9
0.91
0.92
0.93
Binary
Greedy
(b) Classification prediction results withinjected classification scheme.
Figure 15: Classification prediction results.
and avgTop3ShortestChangeRatio. Ultimately, at a daily crawling interval a predictive ROC area
of 90.89% is reached. At this stage, the overall F-measure amounts to 81.69% and the F-measure
for the automated repricing class (AR) stands at 78.31%. Although we reduced the number of
underlying offers by almost factor 100 the achieved metrics are still on a high level.
Figure 16(b) continues with the synthetic daily crawling interval. Now the time range is
getting reduced. Covering only twenty offers, an average predictive ROC area of 89.89% is still
reached. Nonetheless, the F-measure of the automated repricing class drops to 74.81% and in a
time range of ten days it further decreases to 66.19%.
5.1.4 Discussion
In summary, the random forest classifier and binary random feature selector perform best at au-
tomated repricing detection. The feature selectors significantly reduce the number of features
5 Analysis 48
Met
ric
Sampling Rate [Minutes]
ROC areaF-measureF-measure(AR)
0.7
0.75
0.8
0.85
0.9
0.95
15 90 180360
7201440
(a) Sampling rate curve.
Met
ric
Time Range [Days]
ROC areaF-measureF-measure(AR)
0.7
0.75
0.8
0.85
0.9
0.95
80 40 20 10
(b) Time range curve.
Figure 16: Transition of the classification prediction results from theory to practice.
under minimal prediction accuracy loss. Auto balancing can establish class equipartioning.
Promising classifier results have been achieved for the high frequency crawling interval. A
transition to practice is possible, since even a time range of only 20 daily offers good results.
The author recommends at least a time range of 20 days with a two times daily crawling interval.
However, the ground truth is not known. The results are based on supervised classification by
domain experts. So, the actual automated repricing distribution may looks different.
An interesting subsequent question is if the classifier models can be created in order to
handle different sampling intervals and time ranges all at once. The features could enable it
since they are designed for relative values.
In pursuance of unmasking automated repricing resellers, a single price series may is suf-
ficient to make conclusions about the whole assortment. This information could be used in
practice as additional information by repricing providers. Black lists or dedicated strategies
for handling the uncovered resellers are conceivable. This chapter has shown that automated
repricing classification is feasible. Hence, the foundations for prediction and strategy extraction
are laid.
5 Analysis 49
5.2 Price Prediction
The price prediction abstracts from underlying pricing strategies. The hypothesis is that fore-
casts can be made accurate without strategies. A sophisticated combination of decision and
regression trees has been developed. The approach is compared to a broad spectrum of other
non-feature-based predictors. The divide-and-conquer concept is consistently pursued by con-
sideration of:
• Price delta series instead of plain price series
• Different price delta types
• Multiple stage solutions
An exhaustive evaluation has been conducted by varying prediction intervals, target prediction
series and approach configurations.
5.2.1 Concepts
Hyndman and Athanasopoulos (2014, pp. 46-51) describe a special treatment regarding cross
validation during the process of time series evaluation. Randomization can’t be applied since it
deconstructs the underlying time dependency. Time series cross validation needs to consider
prior observations as training set. In order to produce reliable forecasts a minimum number of
observations k is needed. The basic process is described here:
1. Select the observation at time k+i for the testing set and rely on the observations at time
points 1,2,..,k+i-1 to build the prediction model. Compute the forecast error for time k+i.
2. Repeat the above steps for i=1..n whereas n is the number of desired folds.
An example of a five-fold time series cross validation is shown in figure 17.
Time Series1st Fold
2nd Fold
3rd Fold
4th Fold
5th Fold
Figure 17: 5-fold time series cross validation partitioning scheme.
Regression trees differ from normal decision trees by using continuous values at the leafs
(Quinlan 1992, p. 343). A simple regression tree example is shown in figure 18. The values in
the leafs are averages of the incorporating instances. Regression tree algorithms minimize the
instance variations instead of using information gain as splitting criterion (Witten and Frank
2005, pp. 243-244).
5 Analysis 50
bike condition
0 km cycling
poor
season
top
motivation
winter
tour de france stage live on tv
summer
30 km cycling
normal
60 km cycling
superior
80 km cycling
no
0 km cycling
yes
Figure 18: A simple regression tree example.
M5 trees are developed by Quinlan (1992). They are refined regression trees. Instead of nu-
merical values, multivariate linear models are placed in the leafs. Further, a reduced error
pruning technique is applied as well as smoothing for compensating severe discontinuities be-
tween neighbouring linear models. Wang and Witten (1997) enhanced M5 trees by a number
of additions. A compensation factor is added to the the linear models, a minimum instance
number at the leafs is introduced and treatment for missing values is implemented.
Support vector regression is a non-linear technique that tries to find a flat function for ad-
justing to a training set. A tolerance parameter is given which determines a boundary at which
deviations are considered as relevant (via a loss function). In the first place, the training set is
mapped into a feature space. A dot product calculation is performed by the underlying kernel
function and weights are added. Overfitting is prevented by slack variables for constraint relax-
ation (Smola and Schölkopf 2004). Support vector regression is used for example in financial
market prediction and electric utility forecasting (Sapankevych and Sankar 2009, p. 26).
In order to evaluate numeric forecasts, prediction metrics have to be applied which comprise
the magnitude of prediction error. The forecast error of an ith observation can be described
as the difference between actual and predicted value ei = yi − yi. The two most commonly
used metrics are further presented (Hyndman and Athanasopoulos 2014, pp. 46-51; Witten
and Frank 2005, pp. 176-179):
Mean absolute error MAE = av erage(|ei|)
Root mean squared error RMSE =q
av erage(e2i )
5 Analysis 51
The metrics differ by their scales. RMSE penalizes higher deviations in a stronger way.
This thesis uses the forecast package27 of the statistical software R28. This package provides a
wide range of statistical methods for predicting univariate time series which are briefly recapit-
ulated in table 13.
The Kalman filter estimates the state of a process by minimizing the mean squared error.
A predetermined set of equations is recursively used. Predictions are based on current state
and external variables plus their probability distributions and dependencies (Welch and Bishop
2006).
5.2.2 Approach
The first step of the price prediction approaches is a price series simplification. The series are
converted to a price delta series. Such a normalization process has the advantage of having zero
mean and the same scaling (Arias et al. 2013, p. 8:13).
Besides, three price delta types are distinguished:
1. Simple Delta: This type cuts down the price delta to the essentials: A boolean
value if a price delta takes place or not.
2. Direction Delta: This delta type additionally considers the delta direction. Three
values are possible: Positive delta (price hike), no delta and neg-
ative delta (price cut).
3. Absolute Delta: This delta reflects the actual numerical price delta value.
This trichotomy leads to better solvable prediction sub-tasks.
Primarily, two types of approaches can be separated at high level. On the one hand, there are
time series-based approaches which consider only the delta series. This sort of predictors has
the capability of predicting the price delta for any point in time. The reason for this is that their
skeletons are based on time-dependent functions. On the other hand, there are feature-based
approaches which calculate features on the basis of the price series. An approach overview is
given in figure 19.
The no delta predictor forecasts always no price change. The evaluation will show later, that
such a ’no change predictor’ achieves good results in stable environments. This predictor can
be seen as reference because the repricing providers implicitly rely on the assumption of stable
prices during their price calculations.
27 https://cran.r-project.org/web/packages/forecast/index.html28 https://www.r-project.org
5 Analysis 52
Prediction Method Description
Autoregressive IntegratedMoving Average (ARIMA)
Autoregressive means regression against itself plus allowance ofa randomized variable (white noise). Moving average models usepast forecast errors instead of past values for prediction. ARIMAcombines both concepts (Hyndman and Athanasopoulos 2014,pp. 223-230; Hyndman and Khandakar 2008, pp. 8-12).
Exponential Smoothing (ETS) The core element of this method is a compensation of pointsin time: Most recent time points have higher weights. (Hynd-man; Koehler, et al. 2002; Hyndman and Athanasopoulos 2014,pp. 171-212)
Box–Cox transform,ARMA errors, Trend,and Seasonal components(BATS)
This technique considers the acronym-incorporated features. Box-Cox transformation tries to optimize the regression models bylogarithmically transforming the target values. The objective isto probe of different transformation parameters and to select thebest transformation. BATS is a multi-seasonal model. It relies onexponential smoothing (Livera et al. 2011).
Trigonometric BATS (TBATS) TBATS is an extended version of BATS with better considerationof non-integer seasonality (Livera et al. 2011).
Holt Winters (HW) This technique is a simple version of exponential smoothing withonly one seasonal component. It tries to decompose the time se-ries into a seasonal, slope and level component (Hyndman andAthanasopoulos 2014, pp. 188-194).
Double-Seasonal Holt Winters(DSHW)
DSHW is an extended version of HW which considers two sea-sonal components (Taylor 2003).
Seasonal and Trenddecomposition using Loess(STL)
STL supports any type of seasonality. In addition, this methodprovides outlier robustness and support for changing seasonalcomponents. Time series are decomposed into seasonal, trendand irregular components (Hyndman and Athanasopoulos 2014,pp. 163-167).
Neural Network Auto Regres-sion (NNETAR)
Artificial neuronal networks allow complex non-linear relation-ships between the response variables and their predictors byreconstructing simplified nerve structures. NNETAR is a feed-forward neural network with hidden (intermediate) layers. Thatmeans that each layer of nodes receives inputs from the previouslayer until the output (prediction) layer is reached. The predictorsare weighted by a learning algorithm that reduces a cost functionlike RMSE (Hyndman and Athanasopoulos 2014, pp. 276-280).
Table 13: Overview of time series prediction methods of R ’s forecast package.
5 Analysis 53
Delta Classes
Parameters
Type of delta to predict
Sample interval OfferConcentrator
Predictor Class Time Series Based- predicts next delta -
Feature Based- predicts period delta -
Predictor Settings
Predictor Sub-Class Simple Weka R Decision/Regression TreeConcept:Avg delta intervalMost common delta
Concept:Uses Pentaho Pluginfor TS Forecasting- Period Detection - Regression Algorithms
default Base Predictor:SVRMultilayer PerceptronLinear Regression
Simple Direction Absolute1 2 3
Concept:1. Kalman Filter for N/A2. Forecast with different model
Models:ARIMABATSETSTBATSSTLHWDSHWNNETAR
Concept:1. Feature Calculation (Current + Historic)2. Direction Delta Prediction with Random Forest3. Absolute Delta Prediction with M5 Tree
Balancing Mechanism:NoneSMOTEWeight-based
Dummy
default
No DeltaConcept:Predicts alwaysno delta
Grid Search:Min Instances#Attributes...
Hybrid- Time Series + Features -
Weka OverlayConcept:See TS-Based+ consider Features viaOverlay Data(„Intervention Variables“)
Base Predictor:SVRMultilayer PerceptronLinear Regression
Pretraining
Figure 19: The price delta prediction concept.
5A
nalysis54
The simple predictor represents an heuristic which calculates the average price delta interval.
As soon as during the prediction process a new begun time interval is detected, the most fre-
quent price delta is predicted.
The R predictor uses the forecast package of R. R is controlled by the rscala29 plugin which
enables R execution within scala. In order to cope with missing values in the dataset, a Kalman
filter with ARIMA approximation has been used to impute the missing values.30 The previously
presented eight time series prediction methods from table 13 can be used as predictors (namely
ARIMA, ETS, BATS, TBATS, HW, DSHW, STL, NNETAR). In general, the forecast package of R
has the benefit of auto-configuration by automatically analyzing the given time series. Since
the Holt Winters methods (HW, DSHW) can not treat zero or negative values, the corresponding
delta series has been transitioned to a positive series. Afterwards, the prediction is translated
back for rematching the delta series.
The weka predictor is based on a forecasting plugin from Pentaho31 which relies on Weka. The
basic concept comprises a time point deconstruction by removing the temporal order. Hereby,
the time points are split up into ’lagged variables’. Up to 24 lags are allowed. The here used
implementation considers time dynamics of: a) hour of day b) morning/afternoon c) working
day/weekend and d) day of week. The actual prediction is based on the lagged variables and
an underlying base predictor. Linear regression, support vector regression (SVR) and a multi-
layer perceptron (MLP) can be selected. Besides, this approach supports balancing by weights
in order to handle imbalanced datasets.
The weka overlay predictor supports the same characteristics as the normal weka predictor
plus the consolidation of overlay data. Overlay data, also known as ’intervention variables’, is
the incorporation of time-specific features. The features of the decision/regression tree predic-
tor approach are used.
The decision/regression tree predictor is based on features. The features are separated into
current and historic features. The fifteen current features are only valid for a certain point in
time and they are as implicit as possible. They are shown in table 14. The introduction of this
new class of features has the main goal of deriving meta rules that consider both classes of
features e.g.
IF currentHour==9 AND mainDeltaTime==9 AND positionLost THEN predict delta.
The decision/regression tree predictor simplifies the prediction task by using two stages:
1. Prediction of the direction delta with a random forest
29 https://cran.r-project.org/web/packages/rscala/index.html30 The R package ’imputets’ has been used for this purpose: https://cran.r-project.org/web/
packages/imputeTS/index.html31 http://wiki.pentaho.com/display/DATAMINING/Time+Series+Analysis+and+
Forecasting+with+Weka
5 Analysis 55
Nr Name Description Range Meta
1 currentDay Current week day [UTC] [1..7]
2 currentHour Current hour of day [UTC] [0..23]
3 currentPosition Current position without delivery costs [1..n] �
4 currentPositionWithDelivery Current position considering delivery costs [1..n] �
5 currentLowerGap Current gap to the next lower offer [0..n] �
6 currentHigherGap Current gap to the next higher offer [0..n] �
7 currentAvailability Availability status of the offer [0,1]
8 hoursSinceLastDelta Time in hours since the reseller changed his price [0..n] �
9 hoursSinceLastPositionLost Time in hours since the reseller deteriorated his price rank [0..n] �
10 hoursSinceLastPositionGained Time in hours since the reseller improved his price rank [0..n] �
11 hoursSinceLastEndogenousPositionChange Time in hours since the reseller changed its position causedby own price change
[0..n] �
12 hoursSinceLastExogenousPositionChange Time in hours since the reseller changed its position causedby competitors
[0..n] �
13 hoursSinceLastAvailable Time in hours since the resellers product was available forthe last time
[0..n] �
14 aloneOnPrice Are competitors in the same price class? [0,1] �
15 currentResellers The number of current resellers [1..n] �
Table 14: Overview of prediction features.
2. If an actual price delta is predicted, the actual deflection is determined with a M5 regres-
sion tree. Hereby, only training data of the current predicted delta class are used.
This approach is enriched with the following capabilities:
Grid searchThe classifier configurations are optimized with grid search. The corresponding parame-
ters can be obtained from appendix G.
Auto balancingAuto balancing handles imbalanced datasets.
Assortment predictionThis capability considers alternative averaged reseller delta features regarding the re-
seller’s assortment. The concerned features can be found in chapter 5.1.2.
PretrainingThe predictor can be assigned with further training instances e.g. with all assortment
instances or all minimum price instances.
Market simulationThe price deltas are predicted for a new time frame. The features are recalculated based
on the prediction and iteratively the next price deltas can be forecasted. In this way the
market can be simulated. However, with each forecast the prediction error increases.
5 Analysis 56
Offers
20-fold TS Cross Validation
For Price Series mininmum price / reseller price
For every Accumulated Interval
Calculate FeaturesCurrent Features Historic Features
Current TimeCurrent PositionTime since last delta...
Avg DeltaMain Delta TimeDegree in Top 3...
Create Weka Instances
Grid SearchPredict Delta Class for next periodwith Random Forest
1
Positive Delta Negative DeltaNo Delta
Price Seriesof current Resellers
Grid SearchPredict Exact Deltawith M5 Regression Trees
use only specific delta class instances
2
Filter
Applied Metrics: MAE & RMSE
Predict with Decision/Regression Tree Approach
Figure 20: The evaluation scheme of the decision/regression tree price predictor.
5 Analysis 57
In a preprocessing step, the offers are synthetically concentrated. Thereby, sampling rates of
24, 12 and 6 hours are chosen as concentration intervals. These intervals reflect the prediction
window. A further prerequisite is the selection of the desired delta prediction type. Besides,
the decision/regression tree predictor is either equipped with pretraining, assortment or nor-
mal prediction method plus a balancing scheme. Afterwards, a nested in-depth evaluation is
applied. The simplified evaluation concept is exemplarily shown with the decision/regression
tree predictor in figure 20:
1. A time series cross validation with up to 80 folds has been conducted. In every fold,
only current offering resellers are considered. For example, a 20-fold time series cross
validation with a daily crawling interval of 80 days leads to a first fold which comprises
the first 60 days (corresponds to 60 features and price series points) in order to predict
the price of the 61th day. The cross validation approach used in this thesis keeps training
(60 days) and testing set (20 days) stable. The different simulated crawling intervals of
24/12/6 hours correspond to a 20-/40-/80-fold time series cross validation.
2. Different product aggregation schemes are applied, namely all products and the car prod-
uct category. The car category has been chosen due to its high degree of pricing dynamics
(chapter 4.3.2).
3. Dedicated price series are analyzed: Either a synthetically minimum price series or reseller
price series. The prediction of the minimum price series is a simplification of the prediction
problem. Since more price deltas occur on the first position and the minimum price series
has generally spoken a high economical impact.
4. The additional features for the current timestamp are calculated.
5. The presented two-stage prediction is applied with configurations optimized by grid
search.
6. The error measures MAE and RMSE are calculated.
A full market simulation over multiple periods is not task of the evaluation.
5.2.3 Evaluation
The evaluation’s main goal is the analysis of the developed decision/regression tree approach
(further called decision tree predictor) for predicting price deltas. Excerpts of a sophisticated
analysis as described in the previous chapter are presented, namely:
1. The prediction of minimum prices (100 products) under consideration of different crawl-
ing intervals and delta types
2. The full prediction of reseller prices of a dedicated product category (5 products) under
consideration of different delta types and a daily crawling interval
5 Analysis 58
RMSE has been chosen as target metric due to its harsher punishment of larger devia-
tions. The RMSE values are all averaged. Deviations within the simple delta are accounted with
one. Deviations within the direction delta are accounted by the gap between following charac-
teristics: priceHike=1, noDelta=0 and priceCut=-1. Deviations within the absolute delta are
considered as such. The evaluation has been conducted with a crawling start time at 8 am
(UTC). Appendix H shows the independence of the starting hour by calculation of the RMSE
stability. A starting hour at 8 am (UTC) has been chosen since at this hour the price delta ratio
peaks (see chapter 4.3.1).
Forecasting Minimum Price Series of all ProductsThe first prediction task includes forecasting of all minimum prices. Synthetically created crawl-
ing intervals of 24, 12 and 6 hours have been selected. The broad spectrum of prediction ap-
proaches has been applied by consideration of all delta types. Visualized are eight prediction
approaches depending on the predictor’s performance (in terms of lowest RMSE):
• The no delta predictor which can be seen as reference
• The simple delta predictor
• The two best configured decision tree predictors
• The two best R predictors
• The two best weka predictors
The complete prediction results for the daily crawling interval are unveiled in appendix I.
An offer concentration with daily intervals increases the average minimum price delta to 25%.
Figure 21 presents the prediction results for the simple delta. The pretrained decision tree predic-
tor performs best in all crawling intervals. The reference no delta predictor has been surpassed
by 5.32% (1440 minutes), 5.83% (720 minutes) and 1.40% (360 minutes). The pretrained
decision tree predictor performs slightly better than its simplified version without pretraining.
The R predictors are only within range at the daily interval with a RMSE of 0.4988. The other
predictors reduce errors with a higher crawling rate whereas HW reaches a RMSE of 0.8482
at the 360 minutes interval. The weka predictors with support vector regression (SVR) strongly
reduce their prediction errors with increasing crawling rate. At the highest crawling rate the
weka predictor with SVR achieves a RMSE of 0.2744 which is the third lowest error measure.
The no delta predictor exhibits an error measure of 0.2741 whereas the pretrained decision tree
predictor can take the lead with 0.2702 RMSE.
An interesting point is the observation of price persistence ratios. Price persistence ratios
(PPR) are considered as: PPR = 1 − delta ratio. The no delta predictor has per definition a
PPR of one. The PPR of the pretrained decision tree predictor increases from 93% (24 hours) to
97% (12 hours) and is finally reaching 99% (6 hours). The PPR of the weka predictor with SVR
massively increases (43%⇒ 93%⇒ 98% respectively to the crawling intervals) which explains
5 Analysis 59
RM
SE
Crawling Interval [Minutes]
No Delta PredictorSimple PredictorPretrained Decision Tree PredictorDecision Tree PredictorHWDSHWWeka Predictor (SVR)Weka Overlay Predictor (SVR)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1440720
360
Figure 21: Minimum price prediction results for simple delta .
the corresponding reduced errors.
Figure 22 presents the direction price delta prediction results. In general, it shows a similar
picture like the previous delta type. The pretrained decision tree predictor takes the lead again.
However, the gaps to the no delta predictor are getting smaller (2.54% ⇒ 1.78% ⇒ 0.10%
respectively to the crawling intervals). Again, the two best R predictors are far behind and the
weka predictors’ PPR is strongly positively correlated with the crawling interval.
In the task of predicting the most difficult delta, the absolute delta, a convergence regarding
the no delta predictor can be observed at a crawling interval of 12 hours. The results are shown
in figure 23. The pretrained decision tree predictor exhibits a RMSE of 6.62 (no delta predictor
6.85 RMSE). The evaluation has only been partly run for the 360 minutes interval due to time-
constraints. Surprisingly, the ETS predictor has a RMSE which is only 0.47% worse than the no
delta predictor even though it has a PPR of 28%. Initially, the weka overlay predictor has a lower
RMSE than its counterpart without overlay (7.04 versus 11.14 at the daily crawling interval).
Across all scenarios the simple predictor has an approximately RMSE which is 10% higher
compared to the no delta predictor. The MLP approaches can’t come close to the leading ones
although they achieved in pre-tests promising results in time series pattern recognition. The
concept of the weka overlay predictor looked promising by consideration of time series, derived
time patterns and features as overlay data. Nonetheless, in this evaluation it could not realize
its potential. In not a single case a decision tree predictor with balancing scheme could manage
to reach the top two decision tree predictors. Therefore, auto balancing is not applicable for
this kind of prediction problem. The lower the sampling interval, the lower the probability of a
price change. So, the approaches are more likely to adapt to the no delta predictor. An example
5 Analysis 60
RM
SE
Crawling Interval [Minutes]
No Delta PredictorSimple Predictor
Pretrained Decision Tree PredictorDecision Tree Predictor
DSHWARIMA
Weka Predictor (SVR)Weka Overlay Predictor (SVR)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1440720
360
Figure 22: Minimum price prediction results for direction delta .
RM
SE
Crawling Interval [Minutes]
No Delta PredictorSimple Predictor
Pretrained Decision Tree PredictorDecision Tree Predictor
ARIMAETS
Weka Predictor (Linear Regression)Weka Overlay Predictor (Linear Regression)
3
4
5
6
7
8
9
1440720
360
Figure 23: Minimum price prediction results for absolute delta.
5 Analysis 61
M5 tree of the pretrained decision tree predictor shows in figure 24 the incorporation of both
feature types for minimum price prediction.
relativePriceSpan
availability
<=0.059
currentHigherGap
>0.059
LM 1 (24/10.457%)
<=0.944
deltaDownRatio
>0.944
LM 2 (3/8.451%)
<=0.417
priceTrend
>0.417
LM 3 (5/6.367%)
<=-0
LM 4 (2/0.23%)
>-0
currentDay
<=1.33
currentHigherGap
>1.33
priceTrend
<=3.5
LM 10 (21/18.76%)
>3.5
mainAbsoluteHigherGap
<=-0
offerRatio
>-0
LM 5 (3/2.136%)
<=0.055
nightDeltaRatio
>0.055
LM 6 (3/11.206%)
<=0.054
LM 7 (3/37.388%)
>0.054
LM 8 (3/9.444%)
<=0.793
LM 9 (6/12.456%)
>0.793
LM 11 (5/26.886%)
<=4.53
LM 12 (5/95.224%)
>4.53
Figure 24: A grown M5 tree for minimum price prediction.
Forecasting Reseller Price Series of the Car Product CategoryThe second prediction task consists of predicting all reseller price series of the car product
category. This category has been chosen because it exposed the highest category delta ratio
in the market analysis (chapter 4.3.2). Hereafter, result excerpts are presented comprising
the prediction results with a daily crawling interval and a comparison between the no delta
predictor and different decision tree approaches. The daily interval has been chosen due to its
high pricing dynamics. The approaches are reduced to the no delta predictor as reference and
different setup decision tree predictors in this excerpt. The approach selection has been made
since both selected predictor types clearly outperform the other approaches. The decision tree
approach is always configured without balancing. Further, the decision tree approach is either
setup with assortment prediction, pretraining or as plain predictor.
The car product category is denoted as high frequency repricing category. After the offer
interval is reduced to a daily one the category exhibits an average daily delta ratio of 35% and
encounters 89 distinct resellers.
Figure 25 presents the prediction results clustered by delta type. Instead of using abso-
lute averaged RMSEs, a new normalized RMSE is introduced. It is normalized regarding the
reference no delta predictor. A lower normalized RMSE means the better in comparison to the
5 Analysis 62
reference predictor. A comparison within the group of decision tree predictors revealed that the
assortment approach can’t provide any advantages over the plain decision tree predictor. The
pretrained approach shows serious advantages over the other two approaches when using the
direction and absolute delta. The pretrained decision tree predictor outperforms the plain de-
cision tree predictor by 45% regarding the absolute delta scenario. However, when using the
simple delta the pretrained predictor is slightly worse (1.74%). The direction delta scenario is
the only prediction case in which the pretrained decision tree predictor can’t take the lead over
the reference predictor. But on the other hand, it can assume the leadership again by 11.01%
over reference predictor in absolute deltas. Complete information of the underlying prediction
values are offered by appendix J.
Nor
mal
ized
RM
SE
Price Delta Types
No Delta Predictor
Decision Tree Predictor
Assortment Decision Tree Predictor
Pretrained Decision Tree Predictor
0.9
1
1.1
1.2
1.3
1.4
1.5
Simple Delta
Direction Delta
Absolute Delta
Figure 25: Reseller price delta prediction results of the car category.
The PPRs of the decision tree predictors can be obtained from table 15. Noticeable is the
high PPR of the pretrained decision tree predictor.
An example M5 tree of the pretrained decision tree predictor shows in figure 26 the incorporation
of both feature types for minimum price prediction.
5.2.4 Discussion
Different decision tree predictors were subject of a thorough evaluation. They were compared to
a broad spectrum of other elaborated approaches. The pretrained decision tree predictor achieves
5 Analysis 63
Delta Type Decision TreePredictor
AssortmentDecision TreePredictor
PretrainedDecision TreePredictor
Simple Delta 68.63% 69.31% 87.01%
Direction Delta 83.68% 83.68% 98.03%
Absolute Delta 75.73% 83.68% 90.92%
Table 15: Price persistence ratios of the decision tree approaches (predictive car category withall resellers).
degreeInTop10WithDelivery
hoursSinceLastEndogenousPositionChange
<=0.959
relativeMedianSpan
>0.959
currentLowerGap
<=8
avgPriceToProduct
>8
hoursSinceLastExogenousPositionChange
<=0.36
degreeInTop10
>0.36
LM 1 (2/4.143%)
<=75
avgLowerGap
>75
LM 2 (4/5.276%)
<=0.558
LM 3 (2/3.655%)
>0.558
LM 4 (4/3.653%)
<=0.858
LM 5 (3/4.825%)
>0.858
LM 6 (6/7.148%)
<=0.888
LM 7 (14/3.55%)
>0.888
LM 8 (15/86.611%)
<=0.033
LM 9 (17/51.637%)
>0.033
Figure 26: A grown M5 tree for reseller price prediction.
the most promising prediction results. In the vast majority of scenarios this predictor is at least
as good as the reference no delta predictor. Thereby, up to 11% error reduction is realized. The
other approaches can not catch up. Most of them are based on plain price series approaches (R
predictors, simple predictor, weka predictors). That alone may be not sufficient for adapting on
the artificial piece-wise price series on CSAs.
The pretrained decision tree approach shows that a focus on a single reseller price series
is not enough. The consideration of the whole assortment by pretraining leads to major pre-
diction improvements. Exemplarily this is shown by the reseller’s absolute delta prediction
in the car category in which the error metric could be improved by 45%. Compared to the
small step of assortment consideration (from one to a maximum of five considered reseller
price series) the pretraining is highly effective. Building on larger datasets should reveal more
5 Analysis 64
prediction-relevant connections like extended reseller assortment data and in general more
training examples which should lead to further error reduction.
In contrast, the usage of assortment features showed no advantages. Auto balancing
schemes have negative prediction implications by increasing the used error metrics. The de-
tachment of the temporal order may deliver a partial explanation.
Prediction of small time frames is not very promising for practice because the pricing en-
vironment is too stable in such cases. Larger prediction time ranges increase the probability of
a price change and therefore makes prediction more reasonable. An one day ahead prediction
would be a good choice for a prediction starting point. Since the minimum price series itself
is very meaningful and easier to predict, the author suggests predicting corresponding absolute
deltas. This information can be wrapped in a price tendency feature which repricing providers
can integrate in their frontend.
Essentially, the presented prediction approaches abstracts from strategies. This has the
major advantage of predicting all reseller prices without knowing the blueprints (strategies).
However, all underlying strategies are somehow considered in terms of implicit mappings via
’hidden layers’. Knowing pricing strategies is very valuable, but knowing all strategies in order
to make full market predictions is an enormous challenge.
5 Analysis 65
5.3 Pricing Strategy Extraction
The pricing strategy extraction relies on a heuristic filtering process with individual filters for
six strategies. The extracted strategies are manually analyzed on sample basis. The extrac-
tion pipeline works with a real dataset instead of synthetic strategies or self-defined models.
Problem-specific methods are applied without need for a training set. Since the task of strategy
extraction is a complex one, the main goal of this analysis is not the full derivation of all pricing
strategies. It is rather seen as proof of concept.
5.3.1 Concepts
Granger causality (Granger 1969) is a statistical test for time series. If past data of time series
X improves the prediction quality of another time series Y than X Grange-causes Y. Granger
causality is based on transforming the Granger’s keynote into a regression model which is solv-
able with a hypothesis test. Shibuya et al. (2009) apply the Granger causality for numerical
and symbolic time series. The authors achieve good results for predicting stock closing prices.
They outperform vector autoregression models for small datasets (small in terms of less than
500 samples).
A motif is a previously unknown pattern in a time series (Tanaka et al. 2005, p. 269). Motif
discovery describes mining of motifs (Lin; Keogh, et al. 2002, p. 1). Discretization transforms
continuous time series data into a discrete equivalent. Discretization is regularly done in motif
discovery by a transformation into a symbolic representation (Lin; Keogh, et al. 2002, pp. 3-4).
A piecewise aggregate function is often used for that purpose (Minnen et al. 2007, p. 3). A
sliding window method is applied for consideration of temporal delays (Tanaka et al. 2005,
pp. 279-281). Random projection can be used for locating approximately equal motifs (Minnen
et al. 2007, pp. 3-4). Tanaka et al. (2005, pp. 279-281) present a motif discovery scheme.
The most frequently appearing patterns are extracted based on a symbolic representation and
a sliding window method. The original time series data is put back and the distances of the
mined pattern classes can be calculated. Finally, thresholds filter the discovered motifs. Minnen
et al. (2007) present an approach for discovering recurring patterns in multivariant time series
data.
5.3.2 Approach
The basic idea behind the strategy extraction concept is a serial extraction process with strategy-
specific methods. Once identified, the strategy is not put back into the extraction pipeline.
Therefore, the extractor order is relevant. The extraction pipeline can be seen as handcrafted
decision tree. The strategy extraction process is shown in figure 27. Besides pricing strategy
types, the extractors are intended to derive the underlying strategy parameters.
A reseller price delta ratio, that is higher than 0.3%, is a precondition for the automated
repricing strategies (time frame, pull-up, target position and interlink). This limit is aligned on
the average delta ratio discovered in the market analysis (chapter 4.3.1). Dedicated strategy
extractors have been implemented which are described in following:
5 Analysis 66
Price Series
Features
StaticStrategyExtractor
Hit&RunStrategyExtractor
Time FrameStrategyExtractor
Pull UpStrategyExtractor
Target PositionStrategyExtractor
InterlinkStrategyExtractor
UnknownStrategy
Figure 27: The pricing strategy extraction pipeline.
1. Static Strategy ExtractorFirstly, the static strategy extractor is applied. A static strategy is characterized by no price
changes. As long as no price changes are detected the strategy is identified as static.
2. Hit and Run Strategy ExtractorA hit and run strategy is characterized by low prices, short offering periods and has been already
detected in CSAs (Haynes and Thompson 2008a, p. 19; Haynes and Thompson 2008b, p. 467).
The corresponding extractor analyzes all price segments (periods of offering) of the reseller’s
price series. If all price segments have a maximum duration of three days and the reseller price
is below the average price, the strategy is identified as hit and run strategy. The number of
price segments, the average offering time in hours and the average position are returned as
underlying strategy parameters.
3. Time Frame Strategy ExtractorThe time frame strategy extractor is two-parted:
1. Initially a motif discovery extractor is applied. It uses the jMotif SAX-VSM32 library which
has been introduced by Senin and Malinchik (2013). The extractor is set up with window
sizes for discovering motifs in the course of days and weeks, as well as a delta alphabet.
The window sizes comprise a minimum number of three weekly occurrences and five
daily appearances. Beforehand, the offers are concentrated to an hourly basis in order to
smooth the price series. The price series is simplified to a direction delta series. Only valid
motifs are accepted. Valid means that all three delta direction characteristics occur or two
of them with each: Two or more occurrences. At least three/five repetitions are required
for weekly/daily motifs. The discovered motif, the number of appearances, the first and
last appearance as well as the identified window size are derived as underlying strategy
parameters.
32 https://github.com/jMotif/sax-vsm_classic
5 Analysis 67
2. The second extractor is a simplified heuristic which is able to identify if two major daily
price change times exist. This can be seen as minimum criterion for further automated
time based strategies. They should each account for more than 25% of all deltas and their
gap should be at least two hours. Due to its generic strategy description this extractor
is applied at last. The two major price change times are returned as underlying strategy
parameters.
4. Pull-Up Strategy ExtractorThis extractor is a special case of the target position strategy extractor. Therefore, the detection
of a target position strategy is a prerequisite. It basically detects if a top three position is hold
and two major price gaps exist which each at least account for 40% of all gaps. The gap to the
next lower competitor should be zero because matching the better position is the first half of
this special strategy. The higher gap should be within ten cents. The pull-up gap is added as
additional underlying strategy parameter.
5. Target Position Strategy ExtractorThe target position strategy extractor expects the following conditions to be fulfilled:
• The average position is lower ten because high target positions are assumed to be unreal-
istic due to its missing customer impact.
• A main position is detected. That means more than 70% of the price series the reseller is
staying on this position. Either:
– A main position with delivery costs is hold.
– A main position without delivery costs is hold.
The target position, the price gap, minimum and maximum boundaries, main price change time
in hours (UTC) and the delivery cost consideration are retrieved or rather analyzed from the
reseller features.
6. Interlink Strategy ExtractorThis strategy extractor performs a Granger causality test. It uses R in combination with the lme-
test library33. The underlying logic can be obtained from algorithm 2. This extractor retrieves
the interlinked competitor, the interlink lag and the corresponding p-value.
A preprocessing step is performed which cleans the offers of unidentified resellers and inexplica-
ble prices. If a reseller offers multiple product variants which are all assigned to a single GTIN,
only the variant with the lowest price is considered. In addition the offers are concentrated on
hourly basis.
33 https://cran.r-project.org/web/packages/lmtest/index.html
5 Analysis 68
Algorithm 2 Interlink strategy extractor scheme.concentrate market data . 120/360 min for daily/weekly motifsconvert reseller price series to delta series . direction deltafor all resellers do
nearCompeti tors← filter surrounding competitors . +- 2% avg reseller pricefor competitor in nearCompeti tors do
for all allowed lags do . [1..23]congruentSeries← calculate intersection: reseller↔ competitorif congruentSeries.size>threshold then . two weeks
causal i t y ← calculate Granger causalityend if
end forend formaxCausali t y ← filter maximum causalityif maxCausali t y .pValue≤threshold then . 0.01
found interlink strategyend if
end for
5.3.3 Evaluation
The preprocessing step yielded in 6,632 analyzable reseller price series. The strategy extraction
pipeline assigned the reseller price series to the following pricing strategies:
• 3,484 manual strategies (static and hit&run strategies)
• 2,641 unknown strategies
• 507 automated strategies (time frame, target position, pull-up and interlink strategies)
The distribution of extracted strategies is plotted in figure 28.
In chapter 5.1 an automated repricing classification dataset has been created by domain experts.
In comparison, the heuristic strategy extraction achieved a recall of 44.90% and a precision of
33.93% of the mentioned dataset. The found automated repricing ratio corresponds to 7.64%.
This ratio corresponds to 5.25% in the classification dataset (chapter 5.1.3).
Subsequently, samples of the resulting strategy buckets are analyzed manually.
Hit and Run StrategyThe 450 discovered hit and run strategies originate from 282 distinct resellers. This strategy
bucket is characterized by an average of 12.54 price segments with a maximum price segment
count of 683. The average offering time is 10.71 hours. Of the 450 strategies, 111 are positioned
in the top three.
Interlink StrategyThe interlink strategy extractor achieves promising results. Approximately every second found
interlink strategy can be justified by manual inspection (sample basis of n=20). Figure 29
5 Analysis 69
Num
ber
ofSt
rate
gies
Extracted Strategy Types
30342641
450 319148 35 5
0
500
1000
1500
2000
2500
3000
3500
StaticUnknown
Hit and Run
Interlink
Time Frame
Target Position
Pull Up
Figure 28: Extracted pricing strategies.
shows two strongly interlinked reseller price series of a car tyre (GTIN 4019238454291). Re-
seller mein-reifen-outlet.de Granger-causes giga-reifen.de. giga-reifen.de has implemented an
interlink strategy and responds within five hours by price alignments. Figure 30 shows two
weakly interlinked reseller price series of a notebook (GTIN 888462348164). Reseller acom-
pc.de Granger-causes future-x.de.
The interlink strategy extractor tends to detect co-dependencies. The initiator is inclear in
such cases. Nearly full congruent reseller time series has been tracked down. Such an in-
terlink has been proven for reseller electronis.de and mp3-player.de for a smartphone (GTIN
888462039147). A closer look (domain check) at those resellers reveals that they belong to the
same company. The same is true for the resellers notebooksbilliger.de and nullprozentshop.de.
nullprozentshop.de responds with slightly delayed price adjustments (within two hours). These
two findings may explain the observed co-dependencies. The strategy extractor spots injected
own offers from other market places, too. Such co-dependencies could be further exploited in
order to derive shop connections and draw a corresponding map.
However, due to the multi-reseller dependency it is conceivable that one price change can
cause multiple resellers to react which triggers a cascade. The reverse engineering of such
cascades or even nested cascades makes locating the original cause more complicated.
Time Frame StrategyThe detected time frame strategies can be divided into 125 simple and 23 motif time frame
strategies. The motif discovery time frame strategy extractor achieved eminent results. All
classified time frame strategies has been manually verified. Six weekly-based and seventeen
daily-based motifs have been discovered.
Reseller reifensuche.com applies a night-based time frame strategy which can be seen in fig-
ure 31 (GTIN 4019238454291). Between 1 am (UTC) and 5 am (UTC) the prices are regularly
5 Analysis 70
Pric
e(E
uro)
Time
mein-reifen-outlet.degiga-reifen.de
81
82
83
84
85
86
87
88
89
90
91
05.11.2015
19.11.2015
03.12.2015
17.12.2015
31.12.2015
14.01.2016
Figure 29: Interlink between mein-reifen-outlet.de and giga-reifen.de.
Pric
e(E
uro)
Time
acom-pc.defuture-x.de
990
1000
1010
1020
1030
1040
1050
1060
05.11.2015
19.11.2015
03.12.2015
17.12.2015
31.12.2015
14.01.2016
Figure 30: Interlink between acom-pc.de and future-x.de.
5 Analysis 71
adjusted to predefined price steps. In this case, the extractor identified 23 times occurrences of
the most frequent motif (between 11/7/2015 and 12/30/2015). Further, this reseller applies
the described strategy on all of his products from the car category.
Pric
e(E
uro)
Time
reifensuche.com
80
82
84
86
88
90
92
94
05.11.2015
19.11.2015
03.12.2015
17.12.2015
31.12.2015
14.01.2016
Figure 31: Night time frame strategy.
Another assortment repricing strategy has been detected for reseller plus.de. This reseller adjusts
his assortment at 0 am (UTC) with predefined prices which can be seen in figure 32.
Unfortunately, the second time frame extractor (heuristic) has been proven as error-prone. No
correct classified time frame strategy has been found on sample basis (n=20) This extractor
shows vulnerability for misclassification of time series containing price segments.
Target Position StrategyThe target position strategy extraction heuristic is underperforming. Out of 35 supposed target
position strategies only two could be manually verified as a real target position strategy. Both
correspond to the same reseller parfumgroup.de. The derived strategies are shown in table
16. Reseller parfumgroup.de tries to underbid the competitor easycosmetic.de and triggers a
downward pricing spiral. Remarkably is the reaction interval of only 30 minutes and the non-
existing boundaries.
Pull-Up StrategyThere is evidence that three extracted price series are correctly assigned to a pull-up strategy.
The underlying extractor assumes apolux.de to be using the pull-up strategy for three products:
GTIN 6998784, 8628270 and 4058900010236. A price series of apolux.de can be found in
figure 8(b) of chapter 4.3.3. The same high-frequency ’pull-ups’ around position one can be
also seen in the three assumed products.
5 Analysis 72
Pric
e(E
uro)
Time
plus.de (GTIN 4006825538250)plus.de (GTIN 4211129851701)
36
38
40
42
44
46
48
50
52
54
05.11.2015
19.11.2015
03.12.2015
17.12.2015
31.12.2015
14.01.2016
Figure 32: Daily assortment repricing strategy.
Reseller GTIN Targ
etPo
siti
on
Del
iver
yC
osts
Con
side
rati
on
Pric
eG
ap
Mai
nPr
ice
Cha
nge
Tim
e(U
TC
)
Min
imu
mPr
ice
Max
imu
mPr
ice
parfumgroup.de 3439602810118 1 true 0.11€ 5 am 26.42€ 29.70€
parfumgroup.de 3439602810019 1 true 0.11€ 5 am 40.49€ 50.94€
Table 16: Correctly identified target position strategies.
5 Analysis 73
Pric
e(E
uro)
Time
parfumgroup.deeasycosmetic.de
42.5
43
43.5
44
44.5
45
45.5
46
24.12.2015
31.12.2015
07.01.2016
14.01.2016
Figure 33: The target position strategy in action (GTIN 3439602810019).
5.3.4 Discussion
This chapter has shown that, depending on the extractor’s quality, pricing strategy extraction is
possible. The bottom line is three-parted.
1. There are the manual strategy extractors (namely static and hit & run) which are based
on simple assumptions. These extractors can explain over 52% of all used strategies.
2. The profound automated repricing strategy extractors (namely interlink and time frame
with motif discovery) can realize their potential.
3. The heuristic-based strategy extractors (namely target position and simple time frame)
fail in its mission. The heuristics are not sufficient in order to cope with the dynamic
environment on a CSA and their artificial price series. Moreover, it is difficult to justify
the heuristic’s thresholds and values which may be overfitted for the dataset.
The presented bucket pre-selection approach has serious disadvantages such as ignoring the
type II error and building only on predefined strategies. However, it doesn’t need training data
and can be stacked with problem-specific extractors.
5 Analysis 74
A practical implementation should focus on small steps: Derivation of strategy parameters like
minimum and maximum price boundaries and crawling date distributions. This information
is already meaningful for the customers. In order to tackle the strategy extraction, the author
suggests a combined approach of the manual strategy extractors (static and hit&run) plus the
classification engine from chapter 5.1.
5 Analysis 75
6 Conclusion
This thesis was driven by unmasking and exploiting prices in e-commerce. Backed by a recent
dataset with 21.6 million crawled offers this thesis goes far beyond existing literature of CSAs.
A market review of repricing providers revealed that a broad spectrum of sophisticated
pricing strategies is already deployed. A competitive market analysis has given evidence for ad-
vanced pricing dynamics which differ on dimensions like time and product layer. The main goal,
namely the derivation of pricing strategies and the forecasting of prices, has been approached by
a consequent divide-and-conquer concept. The smallest step towards extracting pricing strate-
gies is the ability to separate between handcrafted and artificial price series. This task has been
addressed by supervised classification grounded on multiple decision trees and rich auto-tuning
techniques. Even with a restricted daily crawling interval, accurate classification is still possible.
The linchpin of the price prediction is a two-stage decision tree predictor using random forests
and M5 regression trees. The approach exploits assortment knowledge in order to predict single
price series. Up to 11% less prediction errors are made in comparison to a reference stable price
predictor while outperforming a wide range of time series predictors. These results reveal that
strategies itself are not necessary to make good predictions. The presented approach benefits
from the assortment. Vice versa unleashing more optimization potential is expected when larger
assortments are considered. The strategy extraction concept builds on serial extraction with dif-
ferent strategy-dependent filters. The Granger causality-based interlink strategy extractor and
the motif discovery time frame strategy extractor show promising results. However, the general
approach has shortcomings like ignoring the type II error and partially relying on heuristics.
These separated heuristics could not fulfill its task of strategy extraction.
This thesis is limited by using only one CSA whereas cross CSA interactions are not considered.
Further, only a single step ahead prediction is performed. Complete market simulations could
be the next prediction level. Changing and enlarging the selected products could reveal more
interactions and assortment insights. The main machine learning approach concentrates on
decision trees. However, other promising concepts should be explored like artificial neuronal
networks or vector autoregression.
The thesis is enriched by giving practice-oriented transitions for repricing providers whenever
possible.
Enhancing the prediction performance could be achieved by smart recombination (stacking) of
the prediction approaches. Since the presented decision tree predictor relies on two stages, it
can be reused for this experiments. Additionally, executing multiple predictions and deciding
by majority vote is conceivable.
In order to overcome the strategy extraction issue, an all-new concept should be consid-
ered: The gained insights from the market analysis could be used in combination with an imple-
mentation of the identified strategies in order to generate synthetic price series with underlying
strategies. Afterwards, unsupervised learning algorithms are able to train a strategy-dependent
6 Conclusion 76
model. An evaluation should be conducted with a real dataset which enables the estimation of
the prediction accuracy and the adherence that overfitting is avoided. Furthermore, a reseller
price series should not be tagged with an assumed strategy. Instead, a probability vector should
be supplied for all strategies.
Another promising approach would be probing different repricing algorithms and predic-
tion schemes on a real reseller. In such a case, the real strategy and prediction impacts could be
measured instead of relying on theoretic models.
The reverse engineering of price intelligence is viable and the usage of that knowledge brings
us back to the repricing future.
6 Conclusion 77
Bibliography
Agrawal, R.; Ieong, S., and Velu, R. (2011a). Ameliorating Buyer’s Remorse. In: Proceedings ofthe 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’11. San Diego, California, USA: ACM, pp. 351–359.
– (2011b). Timing when to Buy. In: Proceedings of the 20th ACM International Conference onInformation and Knowledge Management. CIKM ’11. Glasgow, Scotland, UK: ACM, pp. 709–718.
Ahmad, H. W.; Zilles, S.; Hamilton, H. J., and Dosselmann, R. (2016). Prediction of retailprices of products using local competitors. In: International Journal of Business Intelligence andData Mining 11 (1), pp. 19–30.
Aprimo (2012). Showrooming Uncovers a New World of Retail Opportunities. URL: http://www.teradata.de/Resources/White-Papers/Showrooming-Uncovers-a-New-World-of-Retail-Opportunities (visited on 09/26/2016).
Arias, M.; Arratia, A., and Xuriguera, R. (2013). Forecasting with twitter data. In: ACM Trans-actions on Intelligent Systems and Technology (TIST) 5 (1), 8:1–8:24.
Baird, N. and Rosenblum, P. (2015). Pricing 2015: Learning To Live In A Dynamic, Pro-motional World. RSR Retail Systems Research. URL: http : / / www . rsrresearch .com/research/pricing- 2015- learning- to- live- in- a- dynamic-promotional-world (visited on 09/26/2016).
– (2013). Tough Love: An In Depth Look at Retail Pricing Practices. RSR Retail Systems Re-search. URL: http://www.rsrresearch.com/research/tough-love-an-in-depth-look-at-retail-pricing-practices (visited on 09/26/2016).
Bakos, J. Y. (1997). Reducing Buyer Search Costs: Implications for Electronic Marketplaces. In:Management Science 43 (12), pp. 1676–1692.
Baye, M. R.; Gatti, J. R. J.; Kattuman, P., and Morgan, J. (2009). Clicks, discontinuities, andfirm demand online. In: Journal of Economics & Management Strategy 18 (4), pp. 935–975.
Baye, M. R. and Morgan, J. (2001). Information gatekeepers on the internet and the competi-tiveness of homogeneous product markets. In: The American Economic Review 91 (3), pp. 454–474.
Baye, M. R.; Morgan, J., and Scholten, P. (2004). Price dispersion in the small and in the large:Evidence from an internet price comparison site. In: The Journal of Industrial Economics 52 (4),pp. 463–496.
Bergen, M.; Ritson, M.; Dutta, S.; Levy, D., and Zbaracki, M. (2003). Shattering the myth ofcostless price changes. In: European Management Journal 21 (6), pp. 663–669.
Błazewicz, J.; Kovalyov, M.; Musiał, J.; Urbanski, A., and Wojciechowski, A. (2010). In-ternet Shopping Optimization Problem. In: International Journal of Applied Mathematics andComputer Science 20 (2), pp. 385–390.
Bodur, H. O.; Klein, N. M., and Arora, N. (2015). Online price search: Impact of price comparisonsites on offline price evaluations. In: Journal of Retailing 91 (1), pp. 125–139.
Boer, A. V. d. (2015a). Dynamic pricing and learning: Historical origins, current research, andnew directions. In: Surveys in Operations Research and Management Science 20 (1), pp. 1–18.
– (2014). Dynamic Pricing with Multiple Products and Partially Specified Demand Distribution.In: Mathematics of Operations Research 39 (3), pp. 863–888.
– (2015b). Tracking the market: Dynamic pricing and learning in a changing environment. In:European Journal of Operational Research 247 (3), pp. 914–927.
Bounie, D.; Eang, B.; Sirbu, M. A., and Waelbroeck, P. (2012). Online Price Dispersion: AnInternational Comparison. Tech. rep. Department of Economics and Social Sciences, TelecomParisTech, pp. 1–19.
Breiman, L. (2001). Random Forests. In: Machine Learning 45 (1), pp. 5–32.
6 Bibliography 78
Bretschneider, U.; Gierczak, M. M.; Sonnick, A., and Leimeister, J. M. (2015). Auf derJagd nach dem günstigsten Preis: Was beeinflusst die Kaufabsicht von Nutzern von Produkt-und Preisvergleichsseiten? In: Marktplätze im Umbruch. Springer, pp. 43–53.
Broeckelmann, P. and Groeppel-Klein, A. (2008). Usage of mobile price comparison sites at thepoint of sale and its influence on consumers’ shopping behaviour. In: The International Reviewof Retail, Distribution and Consumer Research 18 (2), pp. 149–166.
Brynjolfsson, E. and Smith, M. D. (2000). Frictionless Commerce? A Comparison of Internet andConventional Retailers. In: Management Science 46 (4), pp. 563–585.
– (2001). The great equalizer? Consumer choice behavior at Internet shopbots. Tech. rep. 4208-01.MIT Sloan Working Paper, pp. 1–63.
Chawla, N. V.; Bowyer, K. W.; Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic Mi-nority Over-sampling Technique. In: Journal of Artificial Intelligence Research 16 (1), pp. 321–357.
Chen, M. and Chen, Z.-L. (2014). Recent Developments in Dynamic Pricing Research: Multi-ple Products, Competition, and Limited Demand Information. In: Production and OperationsManagement 24 (5), pp. 704–731.
Clay, K.; Krishnan, R., and Wolff, E. (2001). Prices and price dispersion on the web: evidencefrom the online book industry. In: The Journal of Industrial Economics 49 (4), pp. 521–539.
Clement, R. and Schreiber, D. (2013). Internet-Ökonomie. Springer Berlin Heidelberg.Chap. Leistungsfähigkeit elektronischer Märkte, pp. 255–299.
Cohen, W. W. (1995). Fast Effective Rule Induction. In: Proceedings of the twelfth internationalconference on machine learning, pp. 115–123.
Currie, C. S. M.; Cheng, R. C. H., and Smith, H. K. (2007). Dynamic pricing of airline ticketswith competition. In: The Journal of the Operational Research Society 59 (8), pp. 1026–1037.
Dasgupta, P. and Das, R. (2000). Dynamic pricing with limited competitor information in amulti-agent economy. In: Proceedings of the International Conference on Cooperative Informa-tion Systems. Springer, pp. 299–310.
Dasgupta, P. and Melliar-Smith, P. M. (2003). Dynamic consumer profiling and tiered pricingusing software agents. In: Electronic Commerce Research 3 (3-4), pp. 277–296.
Deck, C. A. and Wilson, B. J. (2003). Automated pricing rules in electronic posted offer markets.In: Economic Inquiry 41 (2), pp. 208–223.
DiMicco, J. M.; Greenwald, A., and Maes, P. (2001). Dynamic Pricing Strategies Under a FiniteTime Horizon. In: Proceedings of the 3rd ACM Conference on Electronic Commerce. EC ’01.Tampa, Florida, USA: ACM, pp. 95–104.
Domínguez-Menchero, J. S.; Rivera, J., and Torres-Manzanera, E. (2014). Optimal purchasetiming in the airline market. In: Journal of Air Transport Management 40 (C), pp. 137–143.
Economist, T. (1999). Frictions in cyberspace. URL: http://www.economist.com/node/346410 (visited on 10/06/2016).
Eisen, M. (2011). Amazon’s $23,698,655.93 book about flies. URL: http : / / www .michaeleisen.org/blog/?p=358 (visited on 10/08/2016).
Ellison, G. and Ellison, S. F. (2009). Search, obfuscation, and price elasticities on the internet.In: Econometrica 77 (2), pp. 427–452.
Elmaghraby, W. and Keskinocak, P. (2003). Dynamic Pricing in the Presence of Inventory Consid-erations: Research Overview, Current Practices, and Future Directions. In: Management Science49 (10), pp. 1287–1309.
Etzioni, O.; Tuchinda, R.; Knoblock, C. A., and Yates, A. (2003). To Buy or Not to Buy: MiningAirfare Data to Minimize Ticket Purchase Price. In: Proceedings of the Ninth ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining. KDD ’03. Washington, D.C.:ACM, pp. 119–128.
Bibliography 79
Gönsch, J.; Klein, R.; Neugebauer, M., and Steinhardt, C. (2013). Dynamic pricing with strate-gic customers. In: Journal of Business Economics 83 (5), pp. 505–549.
Granger, C. W. J. (1969). Investigating Causal Relations by Econometric Models and Cross-spectralMethods. In: Econometrica 37 (3), pp. 424–438.
Grover, V.; Lim, J., and Ayyagari, R. (2006). The dark side of information and market efficiencyin e-markets. In: Decision Sciences 37 (3), pp. 297–324.
Groves, W. and Gini, M. (2015). On Optimizing Airline Ticket Purchase Timing. In: ACM Trans-actions on Intelligent Systems and Technology (TIST) 7 (1), 3:1–3:28.
Guyon, I. and Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. In: Journalof Machine Learning Research 3 (March), pp. 1157–1182.
Hackl, F.; Kummer, M. E.; Winter-Ebmer, R., and Zulehner, C. (2014). Market structure andmarket performance in E-commerce. In: European Economic Review 68 (1), pp. 199–218.
Haynes, M. and Thompson, S. (2008a). Entry and exit behavior at a shopbot: E-sellers as kirzne-rian entrepreneurs.
– (2008b). Price, price dispersion and number of sellers at a low entry cost shopbot. In: Interna-tional Journal of Industrial Organization 26 (2), pp. 459–472.
Hertweck, B. M.; Rakes, T. R., and Rees, L. P. (2009). The effects of comparison shoppingbehaviour on dynamic pricing strategy selection in an agent-enabled e-market. In: Internationaljournal of electronic business 7 (2), pp. 149–169.
– (2010). Using an intelligent agent to classify competitor behavior and develop an effective E-market counterstrategy. In: Expert Systems with Applications 37 (12), pp. 8841–8849.
Holland, M. (2014). Fehler bei Amazon: Hunderte Waren für einen Penny verkauft. URL: http://www.heise.de/newsticker/meldung/Fehler-bei-Amazon-Hunderte-Waren-fuer-einen-Penny-verkauft-2490907.html (visited on 10/08/2016).
Hsu, C.-W.; Chang, C.-C., and Lin, C.-J. (2003). A Practical Guide to Support Vector Classifica-tion. Tech. rep. National Taiwan University, Taipei 106, Taiwan.
Hyndman, R. J. and Athanasopoulos, G. (2014). Forecasting: Principles and Practice. Otexts.292 pp.
Hyndman, R. J. and Khandakar, Y. (2008). Automatic Time Series Forecasting: The forecastPackage for R. In: Journal of Statistical Software 26 (3), pp. 1–22.
Hyndman, R. J.; Koehler, A. B.; Snyder, R. D., and Grose, S. (2002). A state space frameworkfor automatic forecasting using exponential smoothing methods. In: International Journal ofForecasting 18 (3), pp. 439–454.
Jung, K.; Cho, Y. C., and Lee, S. (2014). Online shoppers’ response to price comparison sites. In:Journal of Business Research 67 (10), pp. 2079–2087.
Kachani, S. and Shmatov, K. (2010). Competitive Pricing in a Multi-Product Multi-AttributeEnvironment. In: Production and Operations Management 20 (5), pp. 668–680.
Kephart, J. O.; Hanson, J. E., and Greenwald, A. R. (2000). Dynamic pricing by softwareagents. In: Computer Networks 32 (6), pp. 731–752.
Klausegger, C. (2011). Geizhals Händlerbefragung 2011. URL: http://unternehmen.geizhals . de / about / files / presse / Geizhals _ Haendlerstudie _30112011.pdf (visited on 09/26/2016).
– (2009). Österreichische Konsumenten unter der Lupe. URL: http : / / unternehmen .geizhals.de/about/files/presse/Geizhals_Userbefragung.pdf (vis-ited on 09/26/2016).
Kocas, C. (2002). Evolution of Prices in Electronic Markets Under Diffusion of Price-ComparisonShopping. In: Journal of Management Information Systems 19 (3), pp. 99–119.
Kohavi, R. (1995). A Study of Cross-validation and Bootstrap for Accuracy Estimation and ModelSelection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence -
Bibliography 80
Volume 2. IJCAI’95. Montreal, Quebec, Canada: Morgan Kaufmann Publishers Inc., pp. 1137–1143.
Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. In: Artificial Intelligence97 (1-2), pp. 273–324.
Kopalle, P.; Biswas, D.; Chintagunta, P. K.; Fan, J.; Pauwels, K.; Ratchford, B. T., and Sills,J. A. (2009). Retailer Pricing and Competitive Effects. In: Journal of Retailing 85 (1), pp. 56–70.
Kutschinski, E.; Uthmann, T., and Polani, D. (2003). Learning competitive pricing strategies bymulti-agent reinforcement learning. In: Journal of Economic Dynamics and Control 27 (11-12),pp. 2207–2218.
Levin, Y.; McGill, J., and Nediak, M. (2009). Dynamic Pricing in the Presence of Strategic Con-sumers and Oligopolistic Competition. In: Management Science 55 (1), pp. 32–46.
Lin, J.; Keogh, E.; Lonardi, S., and Patel, P. (2002). Finding Motifs in Time Series. In: Proceed-ings of the 2nd Workshop on Temporal Data Mining, pp. 1–11.
Lin, K. Y. and Sibdari, S. Y. (2009). Dynamic price competition with discrete customer choices.In: European Journal of Operational Research 197 (3), pp. 969–980.
Livera, A. M. D.; Hyndman, R. J., and Snyder, R. D. (2011). Forecasting Time Series WithComplex Seasonal Patterns Using Exponential Smoothing. In: Journal of the American StatisticalAssociation 106 (496), pp. 1513–1527.
Lucchese, G.; Ketter, W.; Dalen, J. van, and Collins, J. (2012). Forecasting Prices in DynamicHeterogeneous Product Markets Using Multivariate Prediction Methods. In: Proceedings of the13th International Conference on Electronic Commerce. ICEC ’11. Liverpool, United Kingdom:ACM, 26:1–26:10.
Mei-Pochtler, A. and Hepp, M. (2013). Die neue Welt des Handels. In: Retail Business. Springer,pp. 77–98.
Meyer, S. (2012). Dynamische Preisoptimierung im E-Commerce. In: Information Managementund Consulting Sonderausgabe, pp. 68–75.
Minnen, D.; Isbell, C.; Essa, I., and Starner, T. (2007). Detecting Subdimensional Motifs: An Ef-ficient Algorithm for Generalized Multivariate Pattern Discovery. In: Seventh IEEE InternationalConference on Data Mining (ICDM 2007). Institute of Electrical and Electronics Engineers(IEEE), pp. 1–10.
Moe, W. W. (2003). Buying, Searching, or Browsing: Differentiating Between Online ShoppersUsing In-Store Navigational Clickstream. In: Journal of Consumer Psychology 13 (1-2), pp. 29–39.
Moraga-González, J. L. and Wildenbeest, M. R. (2011). Comparison sites. Tech. rep. 933. IESEBusiness School - University of Navarra, pp. 1–31.
Pathak, B. K. (2012). Comparison shopping agents and online price dispersion: A search cost basedexplanation. In: Journal of Theoretical and Applied Electronic Commerce Research 7 (1), pp. 64–76.
Petrescu, P.; Ghita, M., and Loiz, D. (2014). Google Organic CTR Study 2014. Advanced WebRanking. URL: https://www.advancedwebranking.com/google-ctr-study-2014.html (visited on 09/26/2016).
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Francisco, CA, USA: MorganKaufmann Publishers Inc.
Quinlan, R. J. (1992). Learning with Continuous Classes. In: 5th Australian Joint Conference onArtificial Intelligence. Singapore: World Scientific, pp. 343–348.
Ramezani, S.; Bosman, P. A., and Poutre, H. L. (2011). Adaptive Strategies for Dynamic PricingAgents. In: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelli-gence and Intelligent Agent Technology. Vol. 2. Institute of Electrical and Electronics Engineers(IEEE), pp. 323–328.
Bibliography 81
Riekhof, H.-C. and Wurr, F. (2013). Steigerung der Wertschöpfung durch intelligentes Pricing:Eine empirische Untersuchung. Tech. rep. 2013/02. PFH Private Hochschule Göttingen.
Sapankevych, N. and Sankar, R. (2009). Time Series Prediction Using Support Vector Machines:A Survey. In: IEEE Computational Intelligence Magazine 4 (2), pp. 24–38.
Sato, K. and Sawaki, K. (2013). A continuous-time dynamic pricing model knowing the competi-tor’s pricing strategy. In: European Journal of Operational Research 229 (1), pp. 223–229.
Schieder, C. and Lorenz, K. (2012). Pricing-Intelligence-Studie 2012. Technische UniversitätChemnitz. URL: https://www.tu-chemnitz.de/wirtschaft/wi2/wp/wp-content/uploads/2012/04/Pricing-Studie-State-of-the-Art-im-E-Commerce_v1.5.pdf (visited on 09/26/2016).
Senin, P. and Malinchik, S. (2013). SAX-VSM: Interpretable Time Series Classification Using SAXand Vector Space Model. In: 2013 IEEE 13th International Conference on Data Mining. Instituteof Electrical and Electronics Engineers (IEEE), pp. 1175–1180.
Shibuya, T.; Harada, T., and Kuniyoshi, Y. (2009). Causality quantification and its applications.In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery anddata mining - KDD ’09. Association for Computing Machinery (ACM), pp. 787–796.
Skorupa, J. (2014). Pricing Intelligence Goes to War. RIS. URL: http://risnews.edgl.com/retail-research/Pricing-Intelligence-Goes-to-War90346 (vis-ited on 09/26/2016).
Smola, A. J. and Schölkopf, B. (2004). A tutorial on support vector regression. In: Statistics andComputing 14 (3), pp. 199–222.
Steiner, I. (2012). Appeagle Repricing Glitch Causes Penny Listings on Amazon. eCOMMERCEBYTES. URL: http://www.ecommercebytes.com/cab/abn/y12/m07/i18/s02(visited on 10/08/2016).
Tanaka, Y.; Iwamoto, K., and Uehara, K. (2005). Discovery of Time-Series Motif from Multi-Dimensional Data Based on MDL Principle. In: Machine Learning 58 (2-3), pp. 269–300.
Taylor, J. W. (2003). Short-term electricity demand forecasting using double seasonal exponentialsmoothing. In: Journal of the Operational Research Society 54 (8), pp. 799–805.
Tirole, J. (1988). The Theory of Industrial Organization. MIT Press Ltd, pp. 209–212. 479 pp.Transchel, S. and Minner, S. (2009). The impact of dynamic pricing on the economic order
decision. In: European Journal of Operational Research 198 (3), pp. 773–789.Varian, H. R. (1980). A Model of Sales. In: The American Economic Review 70 (4), pp. 651–659.Waldfogel, J. and Chen, L. (2006). Does information undermine brand? Information intermedi-
ary use and preference for branded web retailers. In: The Journal of Industrial Economics 54 (4),pp. 425–449.
Wan, Y.; Menon, S., and Ramaprasad, A. (2003). A Classification of Product Comparison Agents.In: Proceedings of the 5th International Conference on Electronic Commerce. ICEC ’03. Pitts-burgh, Pennsylvania, USA: ACM, pp. 498–504.
Wang, Y. and Witten, I. H. (1997). Induction of model trees for predicting continuous classes. In:Poster papers of the 9th European Conference on Machine Learning. Springer.
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. In: Machine Learning 8 (3-4), pp. 279–292.
Weisstein, F. L.; Monroe, K. B., and Kukar-Kinney, M. (2013). Effects of price framing on con-sumers’ perceptions of online dynamic pricing practices. In: Journal of the Academy of MarketingScience 41 (5), pp. 501–514.
Welch, G. and Bishop, G. (2006). An Introduction to the Kalman Filter. Tech. rep. TR 95-041.University of North Carolina at Chapel Hill, pp. 1–16.
Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Tech-niques. 2nd ed. Morgan Kaufmann.
Bibliography 82
Zhang, J. and Jing, B. (2011). The Impacts of Shopbots on Online Consumer Search. In: Proceed-ings of the 44th Hawaii International Conference on System Sciences (HICSS), pp. 1–10.
Bibliography 83
A Product Selection Process
This section shows the intermediate steps of the product selection. Table 17 presents a snap-shot of the CSA Billiger.de with its top 40 popular products on 10/25/2015. The corre-sponding categories act as baseline and are scaled up by the factor 2.5 in table 18. This step isperformed with purpose of getting a representative distribution for 100 products. Subsequently,the categories are simplified in table 19 and the number of quintuples is derived.
Position Product Category Price
1 Thierry Mugler Alien Eau de Pa... Gesundheit & Kosmetik – Kosmetik – Parfum 35.94 €
2 Apple iPhone 6 Handys & Telefon – Handys – Handys ohne Vertrag 545.00 €
3 Samsung Galaxy S5 Handys & Telefon – Handys – Handys ohne Vertrag 320.00 €
4 Sony PS4 500GB Unterhaltungselektronik – Konsolen & Zubehör – Konsolen 317.99 €
5 Samsung Galaxy S6 Handys & Telefon – Handys – Handys ohne Vertrag 446.20 €
6 Maxi-Cosi Pebble Baby & Kind – Unterwegs – Autokindersitze 148.54 €
7 Apple iPhone 6 Plus Handys & Telefon – Handys – Handys ohne Vertrag 684.99 €
8 Logitech UE Boom Unterhaltungselektronik – Audio & HiFi – HiFi 129.00 €
9 KitchenAid Artisan Küchenmasch... Haushalt – Küchengeräte – Küchenmaschine 394.68 €
10 Apple iPad mini 3 Computer & Software – Tablet PCs & Zubehör –Tablet PCs 312.79 €
11 Huawei P8 Lite Handys & Telefon – Handys – Handys ohne Vertrag 207.00 €
12 Sony Xperia Z3+ Handys & Telefon – Handys – Handys ohne Vertrag 495.88 €
13 Bosch GSR 10,8-2-LI Professional Heimwerken & Garten – Werkstatt & Werkzeug – Elektrowerkzeug 49.00 €
14 HTC One M9 Handys & Telefon – Handys – Handys ohne Vertrag 459.08 €
15 ABC-Design Turbo 4S Baby & Kind – Unterwegs – Kinderwagen 292.81 €
16 Sony Alpha 6000 Fotografie – Fotografie & Camcorder – Digitalkameras 468.20 €
17 Samsung Galaxy Alpha Handys & Telefon – Handys – Handys ohne Vertrag 349.00 €
18 Valentino Valentina Eau de Par... Gesundheit & Kosmetik – Kosmetik – Parfum 30.99 €
19 ABC-Design 3Tec Baby & Kind – Unterwegs – Kinderwagen 292.81 €
20 Microsoft Lumia 640 Handys & Telefon – Handys – Handys ohne Vertrag 119.84 €
21 Samsung Galaxy A3 Handys & Telefon – Handys – Handys ohne Vertrag 178.48 €
22 Bosch PSR 18 LI-2 Hegoodimwerken & Garten – Werkstatt & Werkzeug – Elektrowerkzeug 78.83 €
23 Samsung Galaxy S5 mini Handys & Telefon – Handys – Handys ohne Vertrag 237.56 €
24 Voltaren Schmerzgel Gesundheit & Kosmetik – Arzneimittel 3.94 €
25 Philips SensoTouch 3D Gesundheit & Kosmetik – Kosmetik – Elektrischer Rasierer 141.31 €
26 Hankook Ventus Prime 2 K115 Auto & Motorrad – Reifen – PKW-Reifen 38.95 €
27 Samsung Galaxy S III mini Handys & Telefon – Handys – Handys ohne Vertrag 111.00 €
28 FC Bayern München Bettwäsche Spiel, Sport & Freizeit – Fanartikel 35.99 €
29 Makita BHP453 Heimwerken & Garten – Werkstatt & Werkzeug – Elektrowerkzeug 64.90 €
30 Goodyear Vector 4Seasons Auto & Motorrad – Reifen – PKW-Reifen 37.10 €
31 Loceryl Nagellack Gesundheit & Kosmetik – Arzneimittel 15.05 €
32 McNeill Ergo Light Compact Baby & Kind – Schulbedarf – Schulranzen Sets 109.95 €
33 Orthomol-Immun Gesundheit & Kosmetik – Arzneimittel 11.46 €
34 CYBEX Pallas Baby & Kind – Unterwegs – Autokindersitze 137.93 €
35 Maxi-Cosi CabrioFix Baby & Kind – Unterwegs – Autokindersitze 115.66 €
36 Bosch PSR 14,4 LI-2 Heimwerken & Garten – Werkstatt & Werkzeug – Elektrowerkzeug 114.99 €
37 Apple MacBook Pro Computer & Software – Notebooks 999.00 €
38 Maxi-Cosi Tobi Baby & Kind – Unterwegs – Autokindersitze 159.31 €
39 ABC-Design Turbo 6S Baby & Kind – Unterwegs – Kinderwagen 249.99 €
40 Quinny Zapp Xtra Baby & Kind – Unterwegs – Kinderwagen 150.64 €
Table 17: Top 40 (10/25/2015 - 13 PM) of Billiger.de.
A Product Selection Process 84
Categories Entries Scaled Entries
Handys & Telefon – Handys – Handys ohne Vertrag 12 30
Baby & Kind – Unterwegs – Autokindersitze 4 10
Baby & Kind – Unterwegs – Kinderwagen 4 10
Heimwerken & Garten – Werkstatt & Werkzeug – Elektrowerkzeug 4 10
Gesundheit & Kosmetik – Arzneimittel 3 7.5
Auto & Motorrad – Reifen – PKW-Reifen 2 5
Gesundheit & Kosmetik – Kosmetik – Parfum 2 5
Baby & Kind – Schulbedarf – Schulranzen Sets 1 2.5
Computer & Software – Notebooks 1 2.5
Computer & Software – Tablet PCs & Zubehör – Tablet PCs 1 2.5
Fotografie – Fotografie & Camcorder – Digitalkameras 1 2.5
Gesundheit & Kosmetik – Kosmetik – Elektrischer Rasierer 1 2.5
Haushalt – Küchengeräte – Küchenmaschine 1 2.5
Spiel, Sport & Freizeit – Fanartikel 1 2.5
Unterhaltungselektronik – Audio & HiFi – HiFi 1 2.5
Unterhaltungselektronik – Konsolen & Zubehör – Konsolen 1 2.5
40 100
Table 18: Top 40 mapped categories of Billiger.de.
Mapped Categories Products Quintuples
Smartphone 30 6
Baby & Kind 20 4
Gesundheit & Kosmetik 15 3
Computer & Software 5 1
Heimwerken & Garten 10 2
Unterhaltungselektronik 5 1
Auto & Motorrad 5 1
Fotografie 5 1
Haushalt 5 1
100 20
Table 19: Product category selection.
A Product Selection Process 85
B Classification Feature Selection Algorithms
Two different feature selection mechanisms have been developed. The first algorithm is pre-sented in algorithm 3. The main idea is sorting the features regarding a calculated metric andselecting the feature with highest metric gain. In every iteration the expected gains are recal-culated and the features are added until no gain is possible or all features are selected. Theselection scheme is wrapped in a n-fold cross validation.
Algorithm 3 Greedy feature selection algorithm.procedure SELECT(selec tedFeatures, remainingFeatures, cur rentBestMeasure)
if remainingFeatures.isEmpty thenreturn selec tedFeatures
elsefor n-fold Cross Validation do
for all remainingFeatures doclassifier.calculateGainsByGridSearch(selec tedFeatures+remainingFeature)
end forend forbestRemainingFeature=selectRemainingFeature(byHighestAvgGain)if bestRemainingFeature.maxGain>0 then
return SELECT(selec tedFeatures.add(bestRemainingFeature),remainingFeatures.remove(bestRemainingFeature),cur rentBestMeasure+bestRemainingFeature.maxGain)
elsereturn selec tedFeatures
end ifend if
end procedure
The greedy feature selection is fast but has serious shortcomings like the possibility ofbeing trapped in a local maximum. Therefore, a second feature selection algorithm has beendeveloped which relies on binary random sampling. This algorithm is shown in algorithm 4.The main idea is picking a feature and creating two classes with random samples. One classcontains the current feature and the other class not. During sampling it crystallizes if the featureshould be added. The whole algorithm is wrapped again with a n-fold cross validation.
Note: The applied metrics are always averaged regarding the number of folds.
B Classification Feature Selection Algorithms 86
Algorithm 4 Binary feature selection algorithm.procedure SELECT(selec tedFeatures, remainingFeatures, cur rentBestMeasure)
if remainingFeatures.isEmpty thenreturn selec tedFeatures
elsepotentialFeature = remainingFeatures.headfor n-fold Cross Validation do
for samples per n dor=selectRandom(remainingFeatures.remove(potentialFeature))classA=classifier.calculateGainByGridSearch(selec tedFeatures+potentialFeature+r)classB=classifier.calculateGainByGridSearch(selec tedFeatures+r)
end forend forif avgGain(classA)>avgGain(classB) && avgGain(classA)>0 then
return SELECT(selec tedFeatures+potentialFeature,remainingFeatures.drop(1),cur rentBestMeasure+avgGain(classA))
elseSELECT(selec tedFeatures, remainingFeatures.DROP(1), cur rentBestMeasure)
end ifend if
end procedure
B Classification Feature Selection Algorithms 87
C Classification Classifiers Grid Search Configuration
This section shows the classifier grid search parameters for the task of classifying price seriesinto manual and automated repricing. The grid search is implemented as brute force approachresulting in 140 configurations for random forest, 42 configuration for C4.5 and 9 configurationsfor REP tree. The list of grid search parameters is shown in table 20.
Classifier Parameter Values
Random Forest Number of Trees 100, 200
Tree Depth default, 10
Minimum Instances 1, 2, 5, 10, 20, 50, 80
Number of Attributes default, 2, 5, 10, 15
C4.5 Confidence Intervals 0.01, 0.05, 0.10, 0.15, 0.25, 0.50
Minimum Instances 1, 2, 5, 10, 20, 50, 80
REP tree Minimum Instances 2, 5, 10, 15, 20, 30, 40, 50, 80
Table 20: Classification grid search parameters.
C Classification Classifiers Grid Search Configuration 88
D Evaluation of Different Balancing Schemes
The automated repricing dataset is imbalanced. The manual repricing (MR) class accounts foralmost 95%. Decision tree approaches often are mislead in such cases since the optimzation isfocussed on the majority class. Figure 34 shows different balancing schemes and their metric im-pacts for REP trees. The averaged results are based on a 10-fold cross validation with pure andinjected classification schemes for both feature selectors: binary and greedy. If no auto balanc-ing countermeasure is applied, the classifier is performance poor for the automated repricingclass (AR). Weight-based auto balancing improves massively the F-measure of the automatedrepricing class at an expense of overall accuracy. SMOTE achieves both: Overall accuracy andclass-dependent accuracy. Similar balancing impacts are observed for C4.5 and random forest.
Ach
ieve
dM
etri
c
Balancing Schemes
F-measure(AR)F-measure(MR)ROC area
0.50.55
0.60.65
0.70.75
0.80.85
0.90.95
1
NoneWeight-based
SMOTE
Figure 34: Different balancing schemes with REP trees.
D Evaluation of Different Balancing Schemes 89
E Detailed Classification Results
This section covers the detailed evaluation results of the automated repricing classification.Table 21 shows the general class prediction results. The class-dependent results can be foundin table 22. Table 23 shows features which are favored by the classifiers. ’Preferred’ means thatat in least of 60% of the cross validation folds the respective feature is selected.
Classifier C4.5 Random Forest REP Tree
Auto Repricer Policy pure injected pure injected pure injected
Feature Selector binary greedy binary greedy binary greedy binary greedy binary greedy binary greedy
Avg Number of Features 12 4 13 6 14 5 14 6 13 3 13 4
Calculated Trees 38210 12695 38420 16145 36330 12695 35930 14060 35740 8890 35750 11225
Avg Tree Size 68 89 145 123 N/A N/A N/A N/A 63 76 101 106
Avg Training ROC Area 97.28% 97.39% 90.45% 90.20% 98.59% 98.35% 93.08% 92.93% 97.58% 96.53% 90.03% 88.42%
Avg Prediction ROC Area 95.94% 95.22% 88.56% 88.04% 97.11% 96.99% 91.48% 90.46% 95.05% 94.70% 87.86% 86.53%
Table 21: Base results of the automated repricing classification.
Classifier C4.5 Random Forest REP Tree
Auto Repricer Policy pure injected pure injected pure injected
Feature Selector binary greedy binary greedy binary greedy binary greedy binary greedy binary greedy
Avg AR Precision 92.56% 91.83% 86.54% 86.92% 94.43% 93.88% 90.31% 88.50% 93.01% 88.41% 86.34% 87.22%
Avg AR Recall 86.00% 83.66% 72.22% 71.41% 81.58% 84.35% 70.48% 73.13% 82.25% 87.34% 71.78% 65.87%
Avg AR F-measure 88.88% 87.24% 78.44% 78.24% 87.33% 88.62% 78.96% 79.79% 86.78% 87.61% 78.03% 74.07%
Avg AR ROC Area 95.94% 95.22% 88.56% 88.04% 97.11% 96.99% 91.48% 90.46% 95.05% 94.70% 87.86% 86.53%
Avg MR Precision 87.09% 85.56% 77.14% 76.31% 84.36% 86.21% 76.53% 78.09% 85.04% 87.71% 76.77% 73.77%
Avg MR Recall 93.99% 93.45% 89.25% 89.50% 95.79% 95.21% 92.69% 90.73% 94.57% 89.59% 89.05% 90.78%
Avg MR F-measure 90.20% 89.14% 82.59% 82.29% 89.49% 90.21% 83.76% 83.84% 89.08% 88.46% 82.27% 81.13%
Avg MR ROC Area 95.94% 95.22% 88.56% 88.04% 97.11% 96.99% 91.48% 90.46% 95.05% 94.70% 87.86% 86.53%
Avg Total Precision 90.61% 89.42% 82.12% 81.89% 90.05% 90.78% 83.59% 83.44% 89.86% 88.93% 81.85% 80.81%
Avg Total Recall 89.99% 88.72% 81.08% 80.81% 88.98% 89.91% 81.99% 82.39% 88.54% 88.52% 80.70% 78.61%
Avg Total F-measure 89.97% 88.68% 80.85% 80.61% 88.94% 89.89% 81.68% 82.14% 88.47% 88.52% 80.46% 77.95%
Avg Total ROC Area 95.94% 95.22% 88.56% 88.04% 97.11% 96.99% 91.48% 90.46% 95.05% 94.70% 87.86% 86.53%
Table 22: Detailed results of the automated repricing classification.
E Detailed Classification Results 90
Auto Repricer Policy pure injected
Feature Selector binary greedy binary greedy
C4.5 Top Features
AvgDelta avgTop3ShortestChangeRatio avgTop3ShortestChangeRatio availability
avgDeltaToMinPriceProduct maxDeltaDayRatio offerRatio avgTop3ShortestChangeRatio
deltaDownRatio degreeInTop3 downUpDeltaRatio
avgTop3ShortestChangeRatio maxDeltaDayRatio offerRatio
offerRatio
maxDeltaDayRatio
Random Forest Top Features
avgDelta avgDelta avgDelta availability
distinctPriceRatio offerRatio availability distinctPriceRatio
deltaDownRatio maxDeltaDayRatio distinctPriceRatio deltaDownRatio
avgTop3ShortestChangeRatio avgPriceToProduct avgTop3ShortestChangeRatio
downUpDeltaRatio deltaUpRatio offerRatio
deltaUpRatio offerRatio mostFrequentCentEnding
offerRatio numberOfResellers
priceSegments longestPlateau
endogenousChangeRatio degreeInTop3
relativeMedianSpan mostFrequentCentEnding
avgDeltaToProduct
REP Tree Top Features
avgDelta N/A avgRelativeLowerGap deltaDownRatio
distinctPriceRatio offerRatio mainDeltaTime
downUpDeltaRatio avgGapToMinPrice
offerRatio
Table 23: Preferred features of the automated repricing classifiers.
E Detailed Classification Results 91
F Large Decision Tree Examples
Figure 35 shows a resulting decision tree with greedy selection and pure classification schemeof a C4.5 tree. The leafs contain the number of classified and misclassified instances.
maxDeltaDayRatio
maxDeltaDayRatio
<= 0.041667
avgTop3ShortestChangeRatio
> 0.041667
manual (3288.0/1.0)
<= 0
avgTop3ShortestChangeRatio
> 0
maxDeltaDayRatio
<= 78806870.156311
maxDeltaDayRatio
> 78806870.156311
manual (183.0/26.0)
<= 0.029412
avgTop3ShortestChangeRatio
> 0.029412
avgTop3ShortestChangeRatio
<= 53487180.009945
auto (96.0/26.0)
> 53487180.009945
avgTop3ShortestChangeRatio
<= 19679666
manual (88.0/22.0)
> 19679666
avgTop3ShortestChangeRatio
<= 10933812.927173
auto (81.0/23.0)
> 10933812.927173
auto (35.0/16.0)
<= 2834000
manual (30.0/8.0)
> 2834000
auto (25.0/7.0)
<= 0.009804
manual (1648.0/107.0)
> 0.009804
avgTop3ShortestChangeRatio
<= 13435897.006015
avgTop3ShortestChangeRatio
> 13435897.006015
auto (2865.0/101.0)
<= 217168.77127
maxDeltaDayRatio
> 217168.77127
auto (2673.0/205.0)
<= 0.33326
avgTop3ShortestChangeRatio
> 0.33326
manual (33.0/14.0)
<= 7074201.758409
auto (54.0/6.0)
> 7074201.758409
avgTop3ShortestChangeRatio
<= 153689177.200511
manual (157.0/11.0)
> 153689177.200511
manual (42.0/9.0)
<= 17764333
avgTop3ShortestChangeRatio
> 17764333
auto (459.0/118.0)
<= 55970299.688461
avgTop3ShortestChangeRatio
> 55970299.688461
manual (31.0/6.0)
<= 71172333
avgTop3ShortestChangeRatio
> 71172333
auto (43.0/11.0)
<= 79634000
maxDeltaDayRatio
> 79634000
manual (33.0/8.0)
<= 0.060414
auto (50.0/23.0)
> 0.060414
Figure 35: A generated C4.5 tree of medium size.
Figure 36 shows a resulting decision with binary selection and pure classification schemeof a C4.5 tree.
maxDeltaDayRatio
avgDeltaToMinPriceProduct
<= 0.034483
offerRatio
> 0.034483
manual (4202.0/15.0)
<= 0.181131
offerRatio
> 0.181131
manual (561.0/12.0)
<= 0.576025
avgDelta
> 0.576025
availability
<= 0.002283
avgDelta
> 0.002283
avgDeltaToMinPriceProduct
<= 0.971063
manual (96.0/2.0)
> 0.971063
manual (42.0/3.0)
<= 0.361438
nightDeltaRatio
> 0.361438
manual (11.0/1.0)
<= 0.00337
nightDeltaRatio
> 0.00337
auto (10.0)
<= 0.07075
manual (10.0/4.0)
> 0.07075
maxPosition
<= 0.007082
auto (29.0)
> 0.007082
avgPosWithDelivery
<= 38.401236
manual (26.0/2.0)
> 38.401236
avgHigherGap
<= 9.306122
auto (43.0/9.0)
> 9.306122
manual (19.0/1.0)
<= 3.523813
avgPosWithDelivery
> 3.523813
manual (10.0/3.0)
<= 3.81046
auto (14.0/2.0)
> 3.81046
offerRatio
<= 0.234237
avgDelta
> 0.234237
manual (114.0/1.0)
<= 0.067521
avgDelta
> 0.067521
availability
<= 0.157098
auto (14.0/1.0)
> 0.157098
numberOfResellers
<= 0.980556
manual (69.0/6.0)
> 0.980556
auto (27.0/7.0)
<= 92.315049
manual (14.0)
> 92.315049
avgDeltaToMinPriceProduct
<= 0.004464
nightDeltaRatio
> 0.004464
numberOfResellers
<= 0.231701
degreeInTop3
> 0.231701
nightDeltaRatio
<= 70.892473
manual (300.0/11.0)
> 70.892473
manual (32.0/1.0)
<= 0.00231
priceTrend
> 0.00231
nightDeltaRatio
<= 0.000012
manual (10.0)
> 0.000012
maxDeltaDayRatio
<= 0.778341
manual (10.0)
> 0.778341
manual (11.0/1.0)
<= 0.041928
maxPosition
> 0.041928
auto (14.0)
<= 9.977267
avgPosWithDelivery
> 9.977267
relativeMedianSpan
<= 29.035273
auto (14.0)
> 29.035273
auto (17.0/3.0)
<= 0.029047
manual (10.0)
> 0.029047
numberOfResellers
<= 0.00004
avgDelta
> 0.00004
numberOfResellers
<= 143.315261
manual (20.0)
> 143.315261
manual (14.0)
<= 48.340436
availability
> 48.340436
manual (10.0)
<= 0.07415
priceTrend
> 0.07415
auto (15.0/7.0)
<= -0.000017
maxDeltaDayRatio
> -0.000017
availability
<= 0.143064
auto (14.0)
> 0.143064
auto (20.0/4.0)
<= 0.999127
manual (10.0/1.0)
> 0.999127
manual (13.0/3.0)
<= 0.001122
availability
> 0.001122
auto (185.0/22.0)
<= 0.999912
maxPosition
> 0.999912
auto (30.0/3.0)
<= 7.493482
manual (30.0/8.0)
> 7.493482
avgDeltaToMinPriceProduct
<= 0
maxDeltaDayRatio
> 0
auto (22.0)
<= 0.077562
maxDeltaDayRatio
> 0.077562
availability
<= 0.178565
auto (18.0/1.0)
> 0.178565
auto (18.0/6.0)
<= 0.986682
manual (42.0/10.0)
> 0.986682
offerRatio
<= 0.0625
avgHigherGap
> 0.0625
numberOfResellers
<= 0.628051
auto (498.0/37.0)
> 0.628051
auto (42.0/5.0)
<= 76
nightDeltaRatio
> 76
auto (10.0/3.0)
<= 0.211098
manual (20.0/1.0)
> 0.211098
nightDeltaRatio
<= 0.06537
avgDelta
> 0.06537
auto (15.0)
<= 0.196855
manual (10.0/1.0)
> 0.196855
availability
<= 0.013021
auto (3111.0/24.0)
> 0.013021
avgHigherGap
<= 0.000012
auto (1834.0/66.0)
> 0.000012
numberOfResellers
<= 7.275119
manual (11.0/2.0)
> 7.275119
auto (92.0)
<= 59.491675
relativeMedianSpan
> 59.491675
manual (10.0/3.0)
<= 0.013161
auto (71.0/12.0)
> 0.013161
Figure 36: A generated C4.5 tree of large size.
F Large Decision Tree Examples 92
G Prediction Classifier Grid Search Configuration
This section shows the classifier grid search parameters for the task of predicting prices. Adecision/regression tree approach has been developed which uses a random forest and a M5 treeclassifier. The grid search is implemented as brute force approach resulting in 24 configurationsfor the first prediction stage (price delta direction) and 144 configurations for the second stage(price delta amplitude prediction). The list of grid search parameters is shown in table 24
Classifier Parameter Values
Random Forest Number of Trees 100
Tree Depth 10
Minimum Instances 2, 5, 10, 20, 50, 80
Number of Attributes default, 5, 10, 20
M5 Tree Minimum Instances 2, 5, 10, 20, 50, 80
Table 24: Price prediction grid search parameters.
G Prediction Classifier Grid Search Configuration 93
H Start Hour Prediction Comparison
This section analyzes the impacts of the start hour on the prediction results. The effect isdemonstrated by an absolute prediction based on a daily crawling interval with synthetic min-imum prices. A 20-fold time series cross validation has been conducted. On the one hand, adecision tree predictor configured with no auto balancing and activated grid search is applied.On the other hand, a no delta predictor is applied. Figure 37 shows the RMSE stability dependingon the start hour of the crawling interval (UTC). The RMSE stability is defined as:
RMSE stability=RMSE(no delta predictor)
RMSE(decision tree predictor)
This metric represents the degree to which extent the decision tree predictor performs betterover the no delta predictor.
RM
SESt
abili
ty
Time UTC [Hours]
0.9
0.95
1
1.05
1.1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Figure 37: Start hours and RMSE stability.
The results show, that high RMSE stability is reached with a standard deviation of 1.5629%.The standard deviation for the simple delta is even better with 0.9459%. Those results illustratethat a fixed start hour can be chosen in order to reduce computation complexity while keepingcomparability of the different prediction approaches.
H Start Hour Prediction Comparison 94
I Detailed Minimum Price Prediction Results
This section provides detailed information about the minimum price prediction results. Theunderlying crawling interval is on daily basis. The results are clustered by the correspondingdelta type: Simple delta (table 25), direction delta (table 26) and absolute delta (table 27). Thedecision tree approach’s configuration refers to the used balancing scheme.
Predictor Configuration MAE RMSE PPR Grid Search Config
No Delta Predictor default 0.2141 0.4577 100.00%
Simple Predictor default 0.2544 0.5002 93.11%
Pretrained Decision Tree Predictor none 0.1911 0.4333 92.75% rfInstances=50;rfAttributes=5
Decision Tree Predictor none 0.2100 0.4555 90.60% rfInstances=50;rfAttributes=20
SMOTE 0.2151 0.4610 86.05% rfInstances=20;rfAttributes=5
Weight-based 0.2268 0.4728 82.05% rfInstances=2;rfAttributes=20
R Predictor BATS 0.6598 0.8114 13.47%
HW 0.2524 0.4988 76.28%
TBATS 0.6598 0.8114 13.47%
DSHW 0.2524 0.4988 76.28%
ARIMA 0.6348 0.7959 16.57%
NNETAR 0.6863 0.8276 10.71%
ETS 0.6301 0.7930 17.76%
STL 0.2524 0.4988 76.28%
Weka Predictor MLP 0.6051 0.7772 20.71%
LR 0.6900 0.8300 10.60%
SVR 0.4341 0.6582 43.43%
Weka Overlay Predictor MLP 0.5878 0.7660 22.65%
LR 0.6854 0.8273 11.06%
SVR 0.6041 0.7762 19.80%
Table 25: Minimum price prediction results for daily simple price deltas.
I Detailed Minimum Price Prediction Results 95
Predictor Configuration MAE RMSE PPR Grid Search Config
No Delta Predictor default 0.2141 0.4577 100.00%
Simple Predictor default 0.2800 0.5465 90.46%
Pretrained Decision Tree Predictor none 0.2013 0.4460 97.40% rfInstances=20;rfAttributes=5
Decision Tree Predictor none 0.2110 0.4698 94.10% rfInstances=20;rfAttributes=5
SMOTE 0.2197 0.4879 91.55% rfInstances=20;rfAttributes=default
Weight-based 0.2708 0.5601 82.55% rfInstances=2;rfAttributes=default
R Predictor BATS 0.8871 1.0298 11.37%
HW 0.3464 0.6591 76.28%
TBATS 0.8871 1.0298 11.37%
DSHW 0.3464 0.6591 76.28%
ARIMA 0.4410 0.7075 66.82%
NNETAR 0.7880 0.9771 20.98%
ETS 0.8019 0.9820 22.56%
STL 0.3464 0.6591 76.28%
Weka Predictor MLP 0.8943 1.0431 9.28%
LR 0.8472 1.0094 14.08%
SVR 0.6098 0.8624 40.87%
Weka Overlay Predictor MLP 0.8760 1.0333 11.22%
LR 0.8580 1.0100 11.27%
SVR 0.8035 0.9898 17.64%
Table 26: Minimum price prediction results for daily direction price deltas.
Predictor Configuration MAE RMSE PPR Grid Search Config
No Delta Predictor default 1.7322 6.8497 100.00%
Simple Predictor default 2.4473 7.8105 89.80%
Pretrained Decision Tree Predictor none 1.7731 6.6221 95.75% rfInstances=20;rfAttributes=20;m5Instances=10
Decision Tree Predictor none 1.8934 7.0232 93.30% rfInstances=20;rfAttributes=10;m5Instances=2
SMOTE 2.0825 7.6431 90.35% rfInstances=10;rfAttributes=10;m5Instances=2
Weight-based 2.9019 8.9427 80.25% rfInstances=5;rfAttributes=10;m5Instances=10
R Predictor BATS 2.3854 7.3318 11.88%
HW 3.3660 10.9172 76.28%
TBATS 2.3854 7.3318 11.88%
DSHW 3.3660 10.9172 76.28%
ARIMA 2.1149 6.9536 68.72%
NNETAR 3.0202 9.1919 15.91%
ETS 1.9060 6.9117 23.22%
STL 3.3660 10.9172 76.28%
Weka Predictor MLP 9.8493 23.624 9.74%
LR 2.5876 9.1841 21.74%
SVR 3.4907 11.0387 18.92%
Weka Overlay Predictor MLP 15.1973 81.8412 11.17%
LR 2.6923 7.1430 15.97%
SVR 4.8870 14.3723 14.48%
Table 27: Minimum price prediction results for daily absolute price deltas.
I Detailed Minimum Price Prediction Results 96
J Detailed Reseller Price Prediction Results
Table 28 provides detailed information about the reseller price prediction results within the carproduct category. The underlying crawling interval is on daily basis.
Delta Type Predictor MAE RMSE PPR Grid Search Config
Absolute Delta
No Delta Predictor 0.7748 5.9499 100.00% default
Decision Tree Predictor 1.1023 7.6885 75.73% rfInstances=2;rfAttributes=0;m5Instances=2
Assortment Decision Tree Predictor 1.0279 8.0185 83.68% rfInstances=80;rfAttributes=5;m5Instances=2
Pretrained Decision Tree Predictor 0.7818 5.2951 90.92% rfInstances=10;rfAttributes=20;m5Instances=20
Direction Delta
No Delta Predictor 0.3161 0.5547 100.00% default
Decision Tree Predictor 0.3758 0.6822 83.68% rfInstances=50;rfAttributes=20
Assortment Decision Tree Predictor 0.3758 0.6822 83.68% rfInstances=50;rfAttributes=20
Pretrained Decision Tree Predictor 0.3165 0.567 98.03% rfInstances=80;rfAttributes=5
Simple Delta
No Delta Predictor 0.3161 0.5547 100.00% default
Decision Tree Predictor 0.2690 0.5146 68.63% rfInstances=10;rfAttributes=5
Assortment Decision Tree Predictor 0.2690 0.5142 69.31% rfInstances=2;rfAttributes=5
Pretrained Decision Tree Predictor 0.2788 0.5236 87.01% rfInstances=20;rfAttributes=0
Table 28: Reseller price prediction results for the car product category.
J Detailed Reseller Price Prediction Results 97