ACMS TV Ratings Midterm Angelini

Television Show Cancelation Model Purposeful Selection of Covariates

Brandon Angelini Topics in Statistics – Logistic Modeling

ACMS 40950 March 4th 2016

Introduction The model’s goal is to predict if a TV show will be canceled by observing a variety of publically available information about the show. The data set includes a binary output of cancelled or renewed, and has continuous covariates 18-49 demographic viewership, previous year 18-49 demo viewership, overall viewership, previous year overall viewership, as well as categorical variable Network (ABC, CBS, NBC, etc.), and binary variables Scripted and Broadcast. The data is an aggregation of Nielsen TV ratings, as archived by TVSeriesFinale.com, and compiled by the researcher (Brandon Angelini). Data Set Variables: Title – Show Title Cancel – 1 if cancelled, 0 if renewed/other ID – Arbitrary numerical ID assigned to show Demo – Ratings for adults 18-49 demographic PrevDemo – Ratings for previous year for adults 18-49 demographic (0 for new shows) Viewers – overall viewership PrevViewers – overall viewership for previous year (0 for new shows) Scripted – 0 if unscripted show (reality), 1 if scripted show (drama, comedy, etc.) Broadcast – 0 if cable, 1 if broadcast network (ABC, CBS, FOX, and NBC typically have

higher distribution because they’re broadcast networks) Network – Network show airs on

ABC – 1 CBS – 2 CW – 3 FOX – 4 Freeform – 5 FX – 6 MTV – 7 NBC – 8 SyFy – 9 TNT – 10

Often the cancelling and renewing of TV shows is looked at as a black box in which TV executives choose which shows live and die for a variety of reasons, but this model attempts to simplify the cancelation decisions to ratings data, and the categories a show occupies. Production companies, media companies, and advertisers could use the model to understand how to allocate resources among shows, allowing them to predict what production decisions executives may make.

Purposeful Selection of Covariates Step 1: Create a univariable logistic regression model for each covariate Covariate Coeff. Std. Err. Odds Ratio G p Demo -1.389 0.238 0.24935 46.57 0.00 PrevDemo -2.592 0.429 0.07487 91.10 0.000 Viewers -0.293 0.052 0.74602 42.92 0.000 PrevViewers -0.568 0.101 0.56666 87.32 0.00 Script 1.999 0.612 7.3817 17.51 0.001 Broadcast -0.013 0.310 0.98708 0.00179 0.966 as.factor(Network)2 -.581 0.427 0.55934 6.89 0.174 as.factor(Network)3 -0.899 0.812 0.40698 6.89 0.268 as.factor(Network)4 -0.150 0.455 0.86071 6.89 0.742 as.factor(Network)5 0.112 0.905 1.1185 6.89 0.901 as.factor(Network)6 -0.398 0.709 0.67166 6.89 0.574 as.factor(Network)7 0.295 0.776 1.3431 6.89 0.704 as.factor(Network)8 0.208 0.373 1.2312 6.89 0.578 as.factor(Network)9 -0.111 0.647 0.89494 6.89 0.864 as.factor(Network)10 0.623 0.660 1.8645 6.89 0.345 All appear to be significant at this stage except for Broadcast, so a model will be fit with all variables except ‘Broadcast’ Step 2: Fit a multivariable model that contains all covariates that are significant in univariable analysis at the 25% level. Covariate Coeff. Std. Err. Z Pr(>|z|) Demo -2.230 1.112 -2.006 0.04489* PrevDemo -2.641 1.564 -1.689 0.09120. Viewers -0.771 0.302 -2.554 0.01064* PrevViewers 0.025 0.347 0.072 0.94257 Script 2.646 0.836 3.163 0.00156** as.factor(Network)2 2.054 0.953 2.154 0.03121* as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07*** as.factor(Network)4 -1.636 0.837 -1.954 0.05073. as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05*** as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07*** as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07*** as.factor(Network)8 -0.730 0.6582 -1.110 0.26698 as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08*** as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 PrevDemo, PrevViewers, and Network 8 don’t appear to be significant, so we must use the likelihood ratio test. For PrevDemo the significance is .0923 (>.05), PrevViewers is .94275 (>.05) so both should be excluded from the model at this point. Excluding Network results in a significance of 1.589e-12, so Network be kept in the model.

The new reduced model doesn’t include data points for previous year (insignificant in multivariable model in step 2) or if the show is on broadcast TV or not (insignificant in univariable model is step 1). Reduced Model Covariate Coeff. Std. Err. Z Pr(>|z|) Demo -2.7532 0.8836 -3.116 0.00183** Viewers -0.8748 0.2209 -3.959 7.51e-05*** Script 3.0433 0.7953 3.827 0.00013*** as.factor(Network)2 2.0340 0.8442 2.409 0.01598* as.factor(Network)3 -7.2073 1.2492 -5.769 7.95e-09*** as.factor(Network)4 -2.1870 0.7657 -2.856 0.00429** as.factor(Network)5 -6.7336 1.3414 -5.020 5.18e-07*** as.factor(Network)6 -7.3584 1.2303 -5.981 2.22e-09*** as.factor(Network)7 -7.0426 1.2802 -5.501 3.77e-08*** as.factor(Network)8 -0.9525 0.5869 -1.623 0.10460 as.factor(Network)9 -7.6447 1.2329 -6.201 5.63e-10*** as.factor(Network)10 -5.2617 1.1354 -4.634 3.58e-06*** Step 3: Check to see if covariates removed from the model in step 2 confound or are needed to adjust the effects of the covariates remaining in the model. With the removal of PrevDemo, coefficients for Demo (21%), Network 4 (33%), and Network 8 (29%) change by >20%, and the removal of PrevViewers results in the coefficients for Viewers (58%), and Script (24%) changing by >20%, meaning that both PrevDemo and PrevViewers are adjusters, and should be left in the model. Additionally, it makes sense intuitively that a show may not only be judged on that year’s performance, but also on the previous year, so both will be kept in the model. Step 4: In univariable analysis (step 1) the Broadcast covariate was deemed to be insignificant (p=0.966). The addition of Broadcast to the current model results in a small change to some of the Network categorical coefficients, but because broadcast is a classification for network variables (ABC, CBS, Fox, and NBC are ‘broadcast channels’, others are not) the effects of Broadcast are likely captured through the inclusion of the network categorical variables, so Broadcast will be left out of the model. Preliminary Main Effects Model: Covariate Coeff. Std. Err. Z Pr(>|z|) Demo -2.230 1.112 -2.006 0.04489* PrevDemo -2.641 1.564 -1.689 0.09120. Viewers -0.771 0.302 -2.554 0.01064* PrevViewers 0.025 0.347 0.072 0.94257 Script 2.646 0.836 3.163 0.00156** as.factor(Network)2 2.054 0.953 2.154 0.03121*

as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07*** as.factor(Network)4 -1.636 0.837 -1.954 0.05073. as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05*** as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07*** as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07*** as.factor(Network)8 -0.730 0.6582 -1.110 0.26698 as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08*** as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Step 5: Check the linearity of the remaining continuous covariates (Demo, PrevDemo, Viewers, and PrevViewers) through the use of lowess plots, and account for any nonlinearlity. Lowess Plots Demo PrevDemo

Viewers PrevViewers

The lowess plots for all 4 continuous covariates appear to be non-linear, so it appears that it will be necessary to use quartile design variables, fractional polynomials, or linear splines. In this case, we’ll attempt to use linear splines to try to get the best possible fit, and test the fit. Using linear splines resulted in a model that appeared to be over fit and yielded huge standard errors, so we’ll try fractional polynomials instead.

Rejected linear splines model Covariate Estimate Std. Err. Z value P values Demo.splinesx.l1 -2.411e+15 9.919e+07 -24305729 <2e-16*** Demo.splinesx.l2 -5.640e+14 2.299e+07 -24526083 <2e-16*** Demo.splinesx.l3 2.843e+15 4.082e+07 69649357 <2e-16*** Demo.splinesx.l4 -2.774e+14 4.816e+07 -5760509 <2e-16*** PrevDemo.splinesx.l1 1.982e+15 7.734e+07 25626005 <2e-16*** PrevDemo.splinesx.l2 -8.001e+14 2.453e+07 -32612936 <2e-16*** PrevDemo.splinesx.l3 -9.628e+14 5.144e+07 -18715833 <2e-16*** PrevDemo.splinesx.l4 -9.876e+14 8.760e+07 -11274429 <2e-16*** Viewers.splinesx.l1 NA NA NA NA Viewers.splinesx.l2 1.401e+13 8.712e+06 1608118 <2e-16*** Viewers.splinesx.l3 -7.136e+14 5.992e+06 -119077111 <2e-16*** Viewers.splinesx.l4 -8.728e+14 2.426e+07 -35975439 <2e-16*** PrevViewers.splinesx.l1 -1.852e+15 1.407e+08 -13159211 <2e-16*** PrevViewers.splinesx.l2 -1.942e+14 7.531e+06 -25786173 <2e-16*** PrevViewers.splinesx.l3 1.676e+14 6.477e+06 25883316 <2e-16*** PrevViewers.splinesx.l4 1.424e+15 2.930e+07 48600026 <2e-16*** Script 1.468e+15 1.268e+07 115751505 <2e-16*** as.factor(Network)2 9.463e+14 1.464e+07 64632365 <2e-16*** as.factor(Network)3 -2.331e+15 2.670e+07 -87309082 <2e-16*** as.factor(Network)4 -6.593e+14 1.634e+07 -40359985 <2e-16*** as.factor(Network)5 -2.173e+15 3.583e+07 -60636932 <2e-16*** as.factor(Network)6 -2.439e+15 2.918e+07 -83604516 <2e-16*** as.factor(Network)7 -3.225e+15 3.421e+07 -94264401 <2e-16*** as.factor(Network)8 -2.763e+14 1.215e+07 -22745070 <2e-16*** as.factor(Network)9 -2.602e+15 3.389e+07 -76767104 <2e-16*** as.factor(Network)10 -1.791e+15 2.764e+07 -64791116 <2e-16*** To determine the fractional polynomials, we’ll attempt to fit using the mfp package in R fit1 <-mfp(Cancel~ fp(Demo) + fp(PrevDemo) +fp(Viewers) +fp(PrevViewers) +Script +as.factor(Network), family=binomial, data=data1, verbose=T) Upon trying fractional polynomials, none of the 4 variables result in p-values that merit the addition of any degree of fractional polynomial, so the continuous covariates could likely remain assumed to be linear relationships (determined through G-Stat/p-value testing). However, the plots appear not linear, so we’ll try adding 1 term fractional polynomials as a sort of compromise between a more complex but statistically insignificant way, and the linear. The output of the fractional polynomial analysis leads us to conclude p1 the value for Demo=0.5, PrevDemo=1, Viewers=1, and PrevViewers=-2. Covariate Estimate Std. Error z value Pr(>|z|) Demo 7.4589 4.4412 1.679 0.093061. DemoSQRT -23.7592 9.9072 -2.398 0.016477* PrevDemo -0.1829 1.9413 -0.094 0.924937

Viewers 2.3239 1.3430 1.730 0.083550. ViewersSQ -0.2950 0.1372 -2.151 0.031488* PrevViewers -0.5409 0.5159 -1.048 0.294423 Script 19.1208 1137.0261 0.017 0.986583 as.factor(Network)2 3.5713 1.6306 2.190 0.028510* as.factor(Network)3 -5.5146 1.6626 -3.317 0.000911*** as.factor(Network)4 -1.2444 0.9568 -1.301 0.193417 as.factor(Network)5 -4.4888 1.7850 -2.515 0.011913* as.factor(Network)6 -5.7524 1.6927 -3.398 0.000678*** as.factor(Network)7 -4.9793 1.8046 -2.759 0.005793** as.factor(Network)8 -0.5257 0.6967 -0.755 0.450501 as.factor(Network)9 -7.1581 1.7966 -3.984 6.77e-05*** as.factor(Network)10 -5.2485 1.4327 -3.663 0.000249*** The introduction of fractional polynomials has changed the coefficient on Script and PrevDemo substantially, and made them statistically insignificant, so at this point it seems rational to conclude that the addition of fractional polynomials is over complicating the model. A linear splines model was attempted, but similar results occurred, so as a result, while we’d like to account of the apparent non-linearity illustrated in the Lowess plots, we cannot introduce a way of accounting for that without severely altering the model or coefficients. The fractional polynomials will be removed from the model, and the continuous covariates will remain assumed to be linear. Step 6: Explore possible interactions among main effects 15 models were fit with all combinations of interactions between covariates, and none were significant at the 5% level. The highest significance was between Demo and Network [6.7%] and between PrevViewers and Network [5.2%], but neither makes sense in context. If demo viewership and network interact, that should happen every year, and we don’t see that for network and PrevDemo (and we don’t see that for Viewers and Network like the interaction between PrevViewers and Network would suggest). As a result of intuition and relatively low significance levels, we won’t include any interactions in the model. Preliminary final model Covariate Coeff. Std. Err. Z Pr(>|z|) Demo -2.230 1.112 -2.006 0.04489* PrevDemo -2.641 1.564 -1.689 0.09120. Viewers -0.771 0.302 -2.554 0.01064* PrevViewers 0.025 0.347 0.072 0.94257 Script 2.646 0.836 3.163 0.00156** as.factor(Network)2 2.054 0.953 2.154 0.03121* as.factor(Network)3 -6.661 1.331 -5.004 5.62e-07*** as.factor(Network)4 -1.636 0.837 -1.954 0.05073. as.factor(Network)5 -6.038 1.4209 -4.250 2.14e-05*** as.factor(Network)6 -6.900 1.3068 -5.280 1.29e-07*** as.factor(Network)7 -6.615 1.3516 -4.894 9.89e-07***

as.factor(Network)8 -0.730 0.6582 -1.110 0.26698 as.factor(Network)9 -7.299 1.3013 -5.609 2.03e-08*** as.factor(Network)10 -4.744 1.212 -3.915 9.03e-05***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Diagnostics and Results Hosmer-Lemeshow Statistic Test of goodness of fit based on fitted values and predicted values. The H-L stat has a null hypothesis that the model fits, and alternative that there is a poor fit, and the null can be rejected at typical p-value levels (5% typically). H-L Values: X-squared = 4.2516, df = 8, p-value = 0.8337 Fail to reject the null, and conclude that there is adequate agreement between the fitted values and the predicted values. Cook’s Distances A way to determine the effect on the coefficients created by “outlier” data points. Cook’s distances are computed, a maximum distance is determined, and the relative effects on the model’s coefficients are determined. If any data point has >20% effect, it is removed and the model is fit again. Observation 115 is the max, and upon removal none of the changes have the required 20% effect, with the max being 3.93% for PrevViewers. K-Folds Validation To evaluate the generalizability of the model we’ll test it on a variety of subsets of the data. To do so, we’ll start by splitting the data set into 5 subsets or ‘folds’, of sizes 56, 56, 56, 57, and 57. The mean error rate over the five ‘folds’ is 0.3196115, which is acceptable for this type of model.

Fold 1 [1:56] Fold 2 [57:112] Fold 3 [113:168] Fold 4 [169:225] Fold 5 [225:282] 0.3035714 0.1428571 0.6428571 0.3333333 0.1754386

AUC Performing an Area Under the Curve analysis yields a value of 0.8687, which is very good discrimination between subjects classified correctly and incorrectly.

Conclusions

The final model seems to describe the TV cancelations for this data set well, but the model may not be generalizable, because there may be groupings to the data points that weren’t taken into account. However, the favorable AUC and 5-fold validation are positive signs that it may still discriminate well in outside data sets.

Potential issues with grouping in the model arise in that it may matter what other shows are in competition in any given year, as it’s unclear if shows are fighting for a fixed share of viewers in a year. It seems possible that shows have the potential to increase viewership overall and fight for ratings independently of other shows, but more likely ratings depend on other shows in the same year, and shows are fighting for a ‘slice of the pie’ of viewers.

Additionally, I would’ve liked to add some type of variable for genre of a show, as that’d help tie similarities between types of shows in addition to types of channels as the categorical network variable does. Currently the scripted variable starts to do this, but there are more classifications like ‘drama’ and ‘comedy’ that may have expected viewership levels, that if they don’t achieve they’ll be canceled. Finally, show air time seems like it would be an interesting statistic to keep on the shows, as it would help normalize ratings and create a “good rating for that time” type of idea, so that shows that air at times that are traditionally associated with lower viewership are evaluated properly.

Overall, the model fit is very successful, and has applications that could prove useful to the general television production community. Applications The television cancelation model’s coefficients can help us draw some interesting conclusions that come from the effects that a coefficient have in the model (positive/negative, and overall magnitudes).

1. Scripted shows are more likely to be canceled than unscripted shows, as shown by the positive coefficient on Script (2.64558). This could likely be related to the fact that there are more scripted shows that air in a given year, but the reality remains, scripted shows are more likely to be canceled.

2. Previous year demographic ratings are the most significant ratings effect on cancelation (PrevDemo -2.64102), followed by current year demographic ratings (Demo -2.22992), overall viewers (Viewers -0.77149) and previous year overall viewers (PrevView 0.02498). This is interesting, as it points to executives caring more about attracting the 18-49 demographic that advertisers often target than overall viewership. This could have broader implications for television, and provide reasoning for the common criticism of television that ‘minorities are underrepresented’ by showing that studios have to appeal to the majority voices in the 18-49 demographic, as dictated by advertisers.

3. It matters what network a show airs on. Some shows may be able to avoid cancelation by airing on a specific network, as it appears each network has it’s own standards for ratings resulting in cancelation, shown by the significance of the network categorical variable.

Appendix Load Data data1 <- read.delim(file.choose(),header=T) attach(data1) Final Model model <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+ Script+as.factor(Network),family=binomial) summary(model) Step 1: mod.demo <-glm(Cancel~Demo,family=binomial) mod.prevdemo <-glm(Cancel~PrevDemo,family=binomial) mod.viewers <-glm(Cancel~Viewers,family=binomial) mod.prevviewers <-glm(Cancel~PrevViewers,family=binomial) mod.script <-glm(Cancel~Script,family=binomial) mod.broadcast <-glm(Cancel~Broadcast,family=binomial) mod.network <-glm(Cancel~as.factor(Network),family=binomial) #Examinep p-values round(coef(summary(mod.demo)),3) round(coef(summary(mod.prevdemo)),3) round(coef(summary(mod.viewers)),3) round(coef(summary(mod.prevviewers)),3) round(coef(summary(mod.script)),3) round(coef(summary(mod.broadcast)),3) round(coef(summary(mod.network)),3) G.demo <- mod.demo$null.deviance-mod.demo$deviance G.prevdemo <- mod.prevdemo$null.deviance-mod.prevdemo$deviance G.viewers <- mod.viewers$null.deviance-mod.viewers$deviance G.prevviewers <- mod.prevviewers$null.deviance-mod.prevviewers$deviance G.script <- mod.script$null.deviance-mod.script$deviance G.broadcast <- mod.broadcast$null.deviance-mod.broadcast$deviance G.network <- mod.network$null.deviance-mod.network$deviance Step 2: mod.1.reduce<- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script, family=binomial) summary(mod.1.reduce) Step 3: mod.1.reduce<- glm(Cancel~Demo+Viewers+Script+as.factor(Network), family=binomial) betas.withPrevDemo <- mod.1.reduce$coefficients[-c(1,4,7)] betas.withPrevDemo <- mod.1.reduce$coefficients[-1] mod.1.test<- glm(Cancel~Demo+PrevDemo+Viewers+Script+as.factor(Network), family=binomial)

betas.withPrevDemo <- mod.1.test$coefficients[-1] betas.wo.PrevDemo <-mod.1.reduce$coefficients[-1] Step 5: ##taken from http://thestatsgeek.com/2014/09/13/checking-functional-form-in-logistic-regression-using-loess/ logitloess <- function(x, y, s) { logit <- function(pr) { log(pr/(1-pr)) } if (missing(s)) { locspan <- 0.7 } else { locspan <- s } loessfit <- predict(loess(y~x,span=locspan)) pi <- pmax(pmin(loessfit,0.9999),0.0001) logitfitted <- logit(pi) plot(x, logitfitted, ylab="logit") } logitloess(Demo,Cancel,0.8) logitloess(PrevDemo,Cancel,0.8) logitloess(Viewers,Cancel,0.8) logitloess(PrevViewers,Cancel,0.8) Linear Splines #Failed method – not used in model #Create knots for both demo and viewers knotsdem <- c(.5,2.2,3.5) knotsview <- c(.2,5.5,13) Demo.splines <- my.4splines(Demo,knotsdemo) Demo.splines <- my.4splines(Demo,knotsdem) PrevDemo.splines <- my.4splines(PrevDemo,knotsdem) Viewers.splines <- my.4splines(Viewers,knotsview) PrevViewers.splines <- my.4splines(PrevViewers,knotsview) mod.linearsplines <- glm(Cancel~Demo.splines+PrevDemo.splines+Viewers.splines+PrevViewers.splines+Script+ as.factor(Network),family=binomial) Fractional Polynomials mod<- mfp(Cancel~fp(Demo)+fp(PrevDemo)+fp(Viewers)+fp(PrevViewers)+Script+ as.factor(Network), family=binomial, data=data1, verbose=T) mod<- glm(Cancel~Demo+DemoSQRT+PrevDemo+Viewers+ViewersSQ+PrevViewers +Script+ as.factor(Network), family=binomial)

Step 6: #Fit Models for all possible interactions mod.int01 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*PrevDemo, family=binomial) mod.int02 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*Viewers, family=binomial) mod.int03 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*PrevViewers, family=binomial) mod.int04 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*Script, family=binomial) mod.int05 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Demo*as.factor(Network), family=binomial) mod.int06 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*Viewers, family=binomial) mod.int07 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*PrevViewers, family=binomial) mod.int08 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*Script, family=binomial) mod.int09 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*as.factor(Network), family=binomial) mod.int09 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Viewers*PrevViewers, family=binomial) mod.int09 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevDemo*as.factor(Network), family=binomial) mod.int10 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Viewers*PrevViewers, family=binomial) mod.int11 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Viewers*Script, family=binomial) mod.int12 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Viewers*as.factor(Network), family=binomial) mod.int13 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevViewers*Script, family=binomial) mod.int14 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+PrevViewers*as.factor(Network), family=binomial) mod.int15 <- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+ as.factor(Network)+Script*as.factor(Network), family=binomial) coef(summary(mod.int01))[16,4] coef(summary(mod.int02))[16,4] coef(summary(mod.int03))[16,4] coef(summary(mod.int04))[16,4] coef(summary(mod.int05))[16,4] coef(summary(mod.int06))[16,4] coef(summary(mod.int07))[16,4] coef(summary(mod.int08))[16,4] coef(summary(mod.int09))[16,4] coef(summary(mod.int10))[16,4] coef(summary(mod.int11))[16,4]

coef(summary(mod.int12))[16,4] coef(summary(mod.int13))[16,4] coef(summary(mod.int14))[16,4] coef(summary(mod.int15))[16,4] Diagnostics Hosmer-Lemeshow #Perform Hosmer-Lemeshow (HL) test via R package ResourceSelection require(ResourceSelection) mod<- glm(Cancel~Demo+PrevDemo+Viewers+PrevViewers+Script+as.factor(Network), family=binomial) hoslem.test(Cancel,mod$fitted.values,g=10) Cook’s distances X <-cbind(rep(1,n),Demo,PrevDemo,Viewers,PrevViewers,Script,as.factor(Network)) p.hats <- mod$fitted.values[-1] V <- diag(p.hats*(1-p.hats)) hs <- diag(sqrt(V)%*%X%*%solve(t(X)%*%V%*%X)%*%t(X)%*%sqrt(V)) rs <- (Cancel-p.hats)/sqrt(p.hats*(1-p.hats)) delta.chis <- rs^2/(1-hs) #obtain delta betas (cook's distances) delta.beta <- rs^2*hs/(1-hs)^2 ##obtain delta deviances #first obtain deviance residuals ds <- resid(mod) delta.D <- ds^2/(1-hs) #Examine observations with large values of diagnostics which.max(delta.beta) delta.beta[114] #For some reason 114 identified 115 as max (somehow things got offset by 1) mod2 <-glm(Cancel[-115]~Demo[-115]+PrevDemo[-115]+Viewers[-115]+PrevViewers[-115]+Script[-115]+as.factor(Network)[-115], family=binomial) 100*(mod2$coefficients-mod$coefficients)/mod$coefficients K-fold cross validation ##Create folds from original data #First k=5 folds, i.e., cut X and y into fifths X.f1 <- as.matrix(X[1:56,1:6]) X.f2 <- as.matrix(X[57:112,1:6]) X.f3 <- as.matrix(X[113:168,1:6]) X.f4 <- as.matrix(X[169:225,1:6]) X.f5 <- as.matrix(X[226:282,1:6]) y.f1 <- Cancel[1:56] y.f2 <- Cancel[57:112] y.f3 <- Cancel[113:168] y.f4 <- Cancel[169:225] y.f5 <- Cancel[226:282] ##Next, create training sets. When using Fold 1 as validation set (X.f1 and y.f1), then all other folds combined are training set, and so on. X.t1 <- rbind(X.f2,X.f3,X.f4,X.f5) X.t2 <- rbind(X.f1,X.f3,X.f4,X.f5) X.t3 <- rbind(X.f1,X.f2,X.f4,X.f5) X.t4 <-

rbind(X.f1,X.f2,X.f3,X.f5) X.t5 <- rbind(X.f1,X.f2,X.f3,X.f4) y.t1 <-c(y.f2,y.f3,y.f4,y.f5) y.t2 <-c(y.f1,y.f3,y.f4,y.f5) y.t3 <-c(y.f1,y.f2,y.f4,y.f5) y.t4 <-c(y.f1,y.f2,y.f3,y.f5) y.t5 <-c(y.f1,y.f2,y.f3,y.f4) ###Now, use each training set to fit a regression model and each Fold as a validation set, recording the error rate each time ##Fold 1 as validation mod1<-glm(y.t1~X.t1,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi1 <- plogis(cbind(1,X.f1)%*%mod1$coefficients) yhat1 <- round(pi1) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate1 <- length(which (y.f1 != yhat1))/length(y.f1) err.rate1 ##Fold 2 as validation mod2<-glm(y.t2~X.t2,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi2 <- plogis(cbind(1,X.f2)%*%mod2$coefficients) yhat2 <- round(pi2) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate2 <- length(which (y.f2 != yhat2))/length(y.f2) err.rate2 ##Fold 3 as validation mod3<-glm(y.t3~X.t3,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi3 <- plogis(cbind(1,X.f3)%*%mod3$coefficients) yhat3 <- round(pi3) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate3 <- length(which (y.f3 != yhat3))/length(y.f3) err.rate3 ##Fold 4 as validation mod4<-glm(y.t4~X.t4,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi4 <- plogis(cbind(1,X.f4)%*%mod4$coefficients) yhat4 <- round(pi4) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate4 <- length(which (y.f4 != yhat4))/length(y.f4) err.rate4 ##Fold 5 as validation mod5<-glm(y.t5~X.t5,family=binomial) #Compute fitted values for for validation set data uisng coefficients from training model pi5 <- plogis(cbind(1,X.f5)%*%mod5$coefficients) yhat5 <- round(pi5) #Compute error rate, i.e., agreement between predictions and actual y values in validation set err.rate5 <- length(which (y.f5 != yhat5))/length(y.f5) err.rate5 #compute mean error rate over five folds mean(c(err.rate1,err.rate2,err.rate3,err.rate4,err.rate5))

Documents

ACMS TV Ratings Midterm Angelini