19
Ensemble Modeling Assignment 3 -Data Analytics Syam Murali ( A0134602U) Arvind Kozhiyalam ( A0134599N) Upma Vermani ( A0134605M) Arun Sankar ( A0134606X)

Ensemble Modelling - Assignment 3 - DA

Embed Size (px)

Citation preview

Page 1: Ensemble Modelling - Assignment 3 - DA

Ensemble Modeling Assignment 3 -Data Analytics

Syam Murali ( A0134602U)

Arvind Kozhiyalam ( A0134599N)

Upma Vermani ( A0134605M)

Arun Sankar ( A0134606X)

Page 2: Ensemble Modelling - Assignment 3 - DA

Executive Summary

Overview Currently, the business orders for next day are based on yesterday’s demand.

However several factors affect bike rental demand. It is essential that these factors are considered while

forecasting. The proposed model considers the different factors while generating forecasts.

Comparison of Model Forecasts for July 2012 – Dec 2012

Model Profit ($) % increase in profit

over current model

Current Model 794,128 0

Proposed Model 1,004,036 26.4%

Through the implementation of the proposed model the business stands to gain an additional profit of

26.4% ($209,908).

The proposed model is an ensemble of Linear Regression Models and is built on 18 months of data. (Jan

2011 to Jun 2012). The profit projections are tested on July 2012 – Dec 2012 data.

Recommendations

1. During weekdays, there is high demand between 7am - 9am and 5pm - 7pm, moderate towards

the afternoon and negligible demand in the night. It is recommended that the business ensure that

sufficient bikes are made available during peak hours based on the model’s hourly predictions.

2. Bike demand is highly seasonal with winter season having the highest overall demand.

3. Registered users are the primary business drivers who rent on a regular basis, for commute to

work and back; while casual users have an occasional high demand when weather conditions are

suitable and on holiday/weekends. Significant differences were observed in usage of bikes by

casual and registered users and the business should forecast for these set of users separately.

4. The demand for casual bike rentals are mainly affected by 3 factors: Temperature, Weekend and

Humidity. Demand for casual bike rents sharply increases during weekends. It is recommended

that the business account for increased casual bike demand during the weekends.

5. High demand for bike rents (registered users) occurs during winter months. Demand is also

depended on day of the week; with the demand decreasing during weekends. Demand for bikes

from registered users are less during the weekends, the business can cut down on costs by not

overstocking. ( based on the model’s predictions)

6. Higher temperatures lead to increase in demand whereas increase in humidity leady to a

lowering of demand. Bikers prefer warm days which are not humid.

7. The proposed model’s performance declines with age. Therefore, business should also consider

frequently retraining the model, preferably every month.

8. Test data for multiple years would help generate better models.

Page 3: Ensemble Modelling - Assignment 3 - DA

Data Cleaning and Exploratory data analysis

Data Description

The data for analysis is from a two-year log of bikes being rented in a bike sharing system in

Washington, D.C., USA, known as Capital Bike Sharing (CBS).

Hourly data was considered for the analysis. The data set has a total of 17,379 hourly observations.

List of Variables

S.No. Variable Description Variable Type

1 instant Record index

2 dteday Date Continuous

3 season Season Categorical

4 yr Year Categorical

5 mnth Month Categorical

6 hr Hour Categorical

7 holiday Whether day is holiday or not Flag

8 weekday Day of the week Categorical

9 workingday If day is neither weekend nor holiday is 1,

otherwise is 0

Flag

10 weathersit 1. Clear, Few clouds, Partly cloudy, Partly

cloudy

2. Mist + Cloudy, Mist + Broken clouds,

Mist + Few clouds, Mist

3. Light Snow, Light Rain + Thunderstorm

+ Scattered clouds, Light Rain + Scattered

clouds

4. Heavy Rain + Ice Pallets + Thunderstorm

+ Mist, Snow + Fog

Categorical

11 temp Normalized temperature in Celsius Continuous

12 atemp Normalized feeling temperature in Celsius Continuous

13 hum Normalized humidity Continuous

14 windspeed Normalized wind speed Continuous

15 casual Count of casual users Continuous

16 registered Count of registered users Continuous

17 cnt Count of total rental bikes including both casual

and registered

Continuous

Data Cleaning

Outliers The predictor variables were inspected for unusual values.

Boxplots of the predictor variables indicated that all values were within acceptable range.

On inspecting the boxplot for the variable; "cnt" (Total count) some extreme values were noticed.

Page 4: Ensemble Modelling - Assignment 3 - DA

Total Bike Rentals

Casual Bike Rentals

Registered Bike Rentals

However the outliers do not seem to be very extreme. Moreover, on aggregation hourly data to daily

level, the values seem to be at an acceptable level.

Since the values are within acceptable limits, all data points are retained for analysis.

Data Partitioning

For the purpose of this analysis, the data set is partitioned into two sections. The train set comprises of

the hourly data from 2011 and the testing set contains the same for 2012.

The training set day comprises of 8645 observations, whereas the testing set has 8734 observations (all

observations are at hourly level).

Page 5: Ensemble Modelling - Assignment 3 - DA

Missing Values and Incorrect Observations

On inspection for missing values, it was noticed that the training data set had 115 missing observations.

The table below shows on such instance of missing observations. There is time difference of 11 hours

between successive observations.

Record Date Hour

396 17-01-2011 23

397 18-01-2011 12

Missing observations can affect the performance of time series based predictions.

Certain anomalies were detected in the temperature distribution

Temperature Distribution

The above graph shows the effects of windchill below about the temperature of 0.3 and humidity above

about 0.48.

Outliers: We noted that there are outliers to the lower right of the main grouping. Upon analysis, we

have found that these 24 points all occur on a single day in August. It is safe to assume that, there would

have been error in the capturing the data.

Exploratory Data Analysis

Variable Correlations with Cnt ( Total bike rentals )

Variable correlations were computed to understand the relationship cnt (Total number of bikes rented)

has with the rest of the variables.

season mnth hr holiday weekday workingday weathersit temp atemp 0.22 0.18 0.41 -0.02 0 0.01 -0.14 0.45 0.45 hum windspeed -0.29 0.09

Page 6: Ensemble Modelling - Assignment 3 - DA

The total number of bikes rented shows the highest correlation with temperature (0.45) and hour of the

day. Higher temperature seems to drive up the number of bike rental. Also, it shows a negative

correlation with humidity. Higher humidity appears to affect the total rentals negatively.

Whereas factors like holiday and working day, etc. do not seem to have much effect on the total bike

rentals.

Since the total bike rentals is a sum of casual rentals and rentals by registered users, we look into

whether the above relationships hold true for casual and registered rentals as well.

Variable Correlations with Casual

season mnth hr holiday weekday workingday weathersit temp atemp 0.14 0.09 0.3 0.05 -0.01 -0.32 -0.16 0.48 0.47 hum windspeed -0.31 0.07

Similar relationships exist for temperature, humidity and hour of the day. However, working day seems

to be a factor for casual rentals with less rentals occurring during working days.

Variable Correlations with Registered

season mnth hr holiday weekday workingday weathersit temp atemp 0.22 0.19 0.39 -0.05 0 0.13 -0.12 0.38 0.38 hum windspeed -0.24 0.08

Registered bike rentals and Total bike rentals share a similar relationship with the variables.

Temperature, time of the day and humidity seem to be major factors.

Correlations between Variables

The correlations between variables are examined to identify variables which are highly correlated with

each other .Highly correlated variables would lead to inaccurate prediction results.

Temp (temperature) and atemp (feel like temperature) are highly correlated (0.99). We remove atemp,

since both the variables convey almost the same information.

Page 7: Ensemble Modelling - Assignment 3 - DA

Season and Month are also highly correlated. The variables are retained and will be examined during

modeling to understand their impact on the model's VIF.

Impact of variables on Bike Rentals

Season: It is important to understand the importance of Season on Bike rentals, as climatic conditions

greatly affect the ridership.

Effect of Season on Ridership

The above chart says that in the summer months, the difference between the number of registered and

casual riders on a given day is small. Conversely, in the colder months, the difference is large - there are

more days with large differences. This makes enough of an impact to keep in the model.

Continuing in this manner, it turns out that the strongest indicators of the difference in the number of

registered v/s casual riders on a given day are:

1. Season

2. Work Day (i.e. non-holiday or weekend)

3. Weather (rain, sun, cloud, snow)

4. Year of the bike share program

This implies that registered riders are more likely to ride than casual riders in the winter, on a working

day, in worse weather.

Page 8: Ensemble Modelling - Assignment 3 - DA

Day of Week and Weather

The hourly demand for bikes shows two peaks - one at 8 in the morning and another at 5 in the evening.

The demand is steady during the noon hours and reduces slowly after 5 pm.

This relationship is also noticed for registered rentals. This can be attributed to registered rentals using

bikes for daily commute to and back from office/work.

Casual bike users show a clear difference in this regard. The demand for casual bike rentals slowly

increase during the day and peaks at around 5 pm. There is little morning demand for bikes amongst

casual users during the morning hours.

Weather does not greatly affect the shape of the demand curve with respect to time of day, just the

magnitude and density.

Page 9: Ensemble Modelling - Assignment 3 - DA

Distinctly differing shape between a work and non-work day, with 2 spikes in increase in ownership in

Weekday, suggests uptake in travelling to Work and on way back home.

Thus, there is a clear difference between the two groups Casual and Registered. Therefore, we will be analyzing these two groups separately.

Wind speed

Wind speed does not seem to have a significant effect on the number of bike rentals. There are a few notable outliers for casual bike rentals in this regard.

Humidity There is a drop in the number of rentals with increase in humidity. This could be because conditions with higher humidity are generally not conducive for biking.

New Variables

To aid in predictive modeling, a few trend variables were created.

trendYesterday: This variable contains the value for the previous day's demand.

trendWeek: This variable is a moving average of the number of rentals for the past 7 days.

trendMonth: This variable is a moving average of the number of rentals for the past 30 days.

trendPrev: This variable contains the value for number of rentals for the previous day of the week. (If

current day is a Monday, then the trend variable will be equal to the number of rentals for the previous

Monday.

Page 10: Ensemble Modelling - Assignment 3 - DA

Since Casual and Registered users are quite distinct in their rental behavior, the trend variables were

created separately for the two groups.

e.g. trendYesterday_C is the trend variable for casual rentals

trendYesterday_R is the similar variable for registered rentals

Modeling

Variables

Model Targets

Separate models were generated for Casual and Registered Rentals.

Casual rentals models Target = Casual (count of casual users)

Registered rental models Target = Registered (count of registered users)

Model Input variables

Model Casual Rentals Model Registered

Rentals

Season Categorical Season Categorical

Mnth Categorical mnth Categorical

Hr Categorical hr Categorical

Holiday Categorical holiday Categorical

Weekday Categorical weekday Categorical

workingday Flag workingday Flag

weathersit Categorical weathersit Categorical

Temp Continuous temp Continuous

Hum Continuous hum Continuous

windspeed Continuous windspeed Continuous

trendYesterday_C Continuous trendYesterday_R Continuous trendWeek_C Continuous trendWeek_R Continuous trendMonth_C Continuous trendMonth_R Continuous trendPrevW_C Continuous trendPrevW_R Continuous

Training and testing set

Train set: Jan 2011 - Dec 2011

Test set: Jan 2012 - Dec 2012

Model Building and Testing

Two separate models were generated for Casual and Registered Rentals, and their predictions were

combined to get the total rental prediction.

Page 11: Ensemble Modelling - Assignment 3 - DA

Different models were generated and their performance benchmarked to select the best model. The

following accuracy measures were considered for benchmarking.

1. RMSE : Root Mean Squared Error

2. MAD : Mean Absolute Deviation

Model 0: Naive Model

This is a random walk model where the previous day's demand is the prediction of the next day. This

model is used for benchmarking the other models. The hourly predictions from the model were

aggregated to daily level for benchmarking and profit calculations.

Benchmarking was done on the test data set.

Accuracy of Naive Model

Measure Value

RMSE 1513.34

MAD 1112.26

Profit $1,441,205

Model 1: Linear Regression

Linear regression was used to predict the number of bike rentals. Step wise variable selection was used

to add and remove variables from the model.

Model for Casual Rentals Final Model Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.23076 0.791070 -4.084 4.47e-05 *** ## seasonsummer 0.991416 0.335680 2.953 0.003151 ** ## seasonwinter 1.035420 0.342314 3.025 0.002496 ** ## hr 0.083244 0.022956 3.626 0.000289 *** ## holidayTRUE 3.329527 0.872225 3.817 0.000136 *** ## weekdayTuesday 1.862357 0.552811 3.369 0.000758 *** ## weekdayWednesday 2.312750 0.576970 4.008 6.16e-05 *** ## weekdayThursday 2.414725 0.579703 4.165 3.14e-05 *** ## weekdayFriday 3.880203 0.568145 6.830 9.09e-12 *** ## weekdaySaturday 9.053754 0.574337 15.764 < 2e-16 *** ## weekdaySunday 6.480375 0.542391 11.948 < 2e-16 *** ## `weathersitlight weather` -2.765033 0.502845 -5.499 3.93e-08 *** ## temp 11.566455 1.036110 11.163 < 2e-16 *** ## hum -5.813031 0.810490 -7.172 7.99e-13 *** ## trendYesterday_C 0.970608 0.009976 97.299 < 2e-16 *** ## trendWeek_C -0.217633 0.016549 -13.150 < 2e-16 ***

Page 12: Ensemble Modelling - Assignment 3 - DA

## trendMonth_C 0.204651 0.013894 14.729 < 2e-16 *** ## trendPrevW_C -0.087799 0.009934 -8.839 < 2e-16 ***

Model Accuracy: Residual standard error: 12.5 on 8627 degrees of freedom Multiple R-squared: 0.8966, Adjusted R-squared: 0.8964 F-statistic: 4401 on 17 and 8627 DF, p-value: < 2.2e-16

Residuals: Min 1Q Median 3Q Max -114.943 -5.695 -0.286 4.535 93.291

Inference: The model has a high Adjusted R squared value of 0.896. Thus, the model does a good job of predicting

the casual rentals.

The demand for casual bike rentals are mainly affected by 3 factors: Temperature, Weekend and

Humidity. Higher temperatures lead to increase in demand whereas increase in humidity leady to a

lowering of demand. Demand for casual bike rents sharply increases during weekends.

Model for Registered Rentals Step wise variable selection was employed to fine tune the model. Final Model Coefficients:

## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.010317 3.288333 0.915 0.360 ## seasonsummer 7.881060 1.626050 4.847 1.28e-06 *** ## seasonwinter 16.270329 2.382837 6.828 9.18e-12 *** ## mnth 0.768124 0.324656 2.366 0.018 * ## hr 2.878827 0.124998 23.031 < 2e-16 *** ## holidayTRUE -19.045360 3.930302 -4.846 1.28e-06 *** ## weekdaySaturday -16.954830 1.830363 -9.263 < 2e-16 *** ## weekdaySunday -13.558294 1.901725 -7.129 1.09e-12 *** ## `weathersitlight weather` -19.623097 2.341296 -8.381 < 2e-16 *** ## temp 78.759302 5.144246 15.310 < 2e-16 *** ## hum -48.388516 3.669374 -13.187 < 2e-16 *** ## trendYesterday_R 0.789897 0.008391 94.136 < 2e-16 *** ## trendWeek_R -0.662315 0.018233 -36.326 < 2e-16 *** ## trendMonth_R 0.410428 0.030785 13.332 < 2e-16 *** ## trendPrevW_R 0.045168 0.008667 5.212 1.91e-07 ***

Model Accuracy:

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 58.24 on 8630 degrees of freedom

Page 13: Ensemble Modelling - Assignment 3 - DA

Multiple R-squared: 0.7174, Adjusted R-squared: 0.7169 F-statistic: 1565 on 14 and 8630 DF, p-value: < 2.2e-16

Residuals:

Min 1Q Median 3Q Max -274.92 -30.33 -4.91 21.03 340.44

Inference:

The model also has a good Adjusted R squared value of 0.72. However, this is not as good as the model

for casual predictions.

High demand for bike rents (registered users) occurs during winter months. Demand is also depended on

day of the week; with the demand decreasing during weekends. Increase in temperature drives up the

demand whereas increased humidity drives it down.

Linear Regression Model – Accuracy To test the overall model's accuracy, the models were run on the test data and the predictions were

summed up. The hourly predictions were further aggregated to daily values.

The model accuracy for daily predictions:

Measure Value

RMSE 852.26

MAD 758.32

Profit $1,762,746.16

The RMSE for this model is significantly better than the Naive Model.

There is also a significant increase in profit.

Improvement over Naive Model: $321,541.16

Model 2: Random Forest

Model for Casual Rentals

Resampling: Cross-Validated (5 fold)

Summary of sample sizes: 6915, 6917, 6916, 6916, 6916

On tuning the model with different mtry( number of variables sampled) :

mtry RMSE 2 13.57467

13 10.65070

24 10.85156

Page 14: Ensemble Modelling - Assignment 3 - DA

RMSE was used to select the optimal model using the smallest value.

The final value used for the model was mtry = 13.

Model for Registered Rentals

Resampling: Cross-Validated (5 fold)

Summary of sample sizes: 6916, 6916, 6916, 6917, 6915

On tuning the model with different mtry( number of variables sampled) :

mtry RMSE

2 45.94593

13 22.93559

24 22.72873

RMSE was used to select the optimal model using the smallest value. The final value used for the model

was mtry = 24.

The hourly predictions were further aggregated to daily values.

Random Forest Model – Accuracy

The model accuracy for daily predictions:

Measure Value

RMSE 932.92

MAD 781.48

Profit $1,760,181.58

The RMSE for this model is significantly better than the Naive Model.

There is also a significant increase in profit.

Improvement over Naive Model: $318,976.57

Model Comparison The models are compared to identify the best performing model.

Model RMSE MAD Profit ($)

Naïve Model 1513.34 1112.26 1,441,205

Linear Regression 852.26 758.32 1,762,746

Random Forest 932.92 781.48 1,760,182

Linear Regression outperforms the others models. Both RMSE and MAD for Linear Regression are lower than the other models.

Page 15: Ensemble Modelling - Assignment 3 - DA

Ensembles It is observed notice that Linear Regression Models provides the best testing results.

An ensemble of linear models was generated to improve prediction accuracy.

In order to create sufficient variance in the predictions, each Model was generated by random sampling

of the training data. (80% of training data)

10 different models were created by random sampling of data.

The different models were tested on the test set to identify the best performing models.

Performance of different Linear Regression Models considered for Ensemble

Model RMSE Profit

2 862.8418 1758522

3 840.572 1767477

4 903.3133 1744826

5 807.7037 1778867

6 837.5898 1768395

7 856.2363 1761868

8 863.7491 1759411

9 868.3082 1756820

10 836.6417 1769006

The models 3, 5, 6 and 10 have lower RMSE values when compared to the other models.

The individual models may not be very stable, since they were generated from random sampling of data.

These models were then ensemble together to generate the final model.

The predictions from the ensemble models are averaged to get the final predictions.

Ensemble Accuracy

Measures

RMSE 830.27

Total Profit from Naive Predictions $1441205

Total Profit from Model Predictions $1771058

Improvement due to Model over Naive Model $329852.6

Model Comparison

Model RMSE Profit ($)

Naïve Model 1513.34 1,441,205

Linear Regression 852.26 1,762,746

Random Forest 932.92 1,760,182

Ensemble Model 830.27 1,77,1058

Page 16: Ensemble Modelling - Assignment 3 - DA

The ensemble model has the least RMSE value and highest Profit.

Profit

Profit Calculations

Profit per day = Revenue per day – Costs per day

Revenue = (bikes rented out by customers * revenue per bike

= min (actual demand, predicted demand) * revenue per bike

Costs = predicted demand * loan cost per bike

Revenue per bike = $3

Loan cost per bike = $2

Ensemble Model Profit Model profit for 2012 = $1,771,058

Model profit as a percentage of total costs for 2012 = 49.2%

Naïve Model Model profit for 2012 = $1,441,205

Model profit as a percentage of total costs for 2012 = 35.2%

Ensemble Model’s Performance with Revenue per rental

Ensemble model performance is better only when revenue is high compared to the costs

The model performance was evaluated for different revenue values keeping cost constant at $2.

The model’s performance over the naïve model increases with increasing Revenue per rental.

However, at revenue greater than $10 per bike, the naïve model performs better. This is due to the fact

that at high revenue levels importance of cost gets

diminished.

Revenue Model Naïve Improvement

2 35,61,897 32,84,974 2,76,923

Page 17: Ensemble Modelling - Assignment 3 - DA

2.1 1,59,302 -2,18,187.10 3,77,489

2.2 3,38,386 -33,810.20 3,72,196

2.3 5,17,470 1,50,566.70 3,66,903

2.4 6,96,554 3,34,943.60 3,61,610

2.5 8,75,638 5,19,320.50 3,56,317

2.6 10,54,722 7,03,697.40 3,51,024

2.7 12,33,806 8,88,074.30 3,45,731

2.8 14,12,890 10,72,451.20 3,40,438

2.9 15,91,974 12,56,828.10 3,35,145

3 17,71,058 14,41,205 3,29,853

3.1 19,50,142 16,25,581.90 3,24,560

3.2 21,29,225 18,09,958.80 3,19,267

3.3 23,08,309 19,94,335.70 3,13,974

3.4 2487393 21,78,712.60 3,08,681

3.5 2666477 23,63,089.50 3,03,388

Both models give the same prediction when revenue = $9.24

Ensemble Model’s Performance with Season and Months To understand if model performance depended on seasons or months, the RMSE value was computed for each season and month.

RMSE

Spring 769.7461478

Summer 781.1242547

Fall 955.2417344

Winter 796.5976288

The model performance is lower for fall when compared to other seasons.

RMSE

January 682.0312805

February 670.8317117

March 923.6624846

April 791.4443095

May 816.0132923

June 746.1080826

July 767.1820004

August 906.9803393

September 1128.546405

October 1000.305523

November 732.9792307

Page 18: Ensemble Modelling - Assignment 3 - DA

December 651.5160767

Similarly, the model performance is lower for August, September, October and March.

Ensemble Model’s Performance with Aging

The model performance was determined for each day to determine the performance change with aging

of the model. It was observed that the model performance declines slowly as the model ages.

Ensemble model with 18 months data The models were rebuilt using 18 months training data. Since Profit is calculated only for six months, we consider RMSE for checking model accuracy.

Model RMSE Profit

1 679.5514 999,557.1

2 657.0268 1758522

3 675.3358 1,000,259

4 660.0453 1,002,866

5 652.6955 1,004,317

6 644.9173 1,004,920

7 678.2894 998,691.5

8 664.2854 1,002,217

9 673.5327 1,001,058

10 696.3528 996,825.5

All the models have significantly better RMSE (lower value) when compared to the model built with 12

months data.

For ensemble, we consider models 2, 4, 5 and 6 since they have the smallest RMSE values.

0

100

200

300

400

500

600

700

800

9001

19

37

55

73

91

10

9

12

7

14

5

16

3

18

1

19

9

21

7

23

5

25

3

27

1

28

9

30

7

32

5

34

3

36

1

R

M

S

E

Day

Page 19: Ensemble Modelling - Assignment 3 - DA

Accuracy of the Final Ensemble RMSE: 653.1097

Total Profit from Naive Predictions: $794,128

Total Profit from Model Predictions: $1,004,036

Improvement due to Model over Naive Model: $209,908.3

Ensemble of models created with 18 months of data clearly outperforms an ensemble created with

12 months of training data.

Data balancing The data was balanced by ensuring that the first six months of the year have half the probability of

being sampled compared to the next six months.

The model was rebuilt using the balanced data.

Final accuracy and profit of the model after fine tuning the ensembles.

RMSE: 667.2481

Total Profit from Naive Predictions: $ 794,128

Total Profit from Model Predictions: $1,001,679

Improvement due to Model over Naive Model: $207,550.6

The model without data balancing performs slightly better than a model with balanced data.

Since the model is by default samples the training set, balancing of the data did not yield any further

performance boost.