2
Statistics 622 Fall 2011 Assignment #5 Yes, the last one! Use the daily_returns_2007.jmp data. This data file includes daily returns on the value-weighted stock market index (VWRETD) for 2007 (from CRSP via WRDS. It is easy to build a model that suffers from over-fitting. Let y t stand for the returns on the value-weighted index in time period t (as in a column of the data file). Twelve columns in this file, named Lag 01 through Lag 12, are lags of y t . The lag of a variable collected over time identifies preceding observations. For example, the 5 th lag of y t is the value 5 rows above, lag(y t ,5) = y t-5 . Lags can be very useful for prediction. For example, one might try to predict tomorrow’s return y t based on today’s y t-1 by regressing y t on y t-1 . JMP has a formula that defines lags. For these questions, build a model using August through October as the estimation sample (65 days, rows 146-210) to predict the value-weighted index in November and December (41 days, rows 211-251). 1. Fit a multiple regression with response VWRETD on its 12 lags in columns Lag 01, …, Lag 12. Don’t use stepwise regression; just fit a model with all 12 lags as explanatory variables. Print the summary of your model (include the R 2 , Anova table, and estimated coefficients). Does this regression promise to predict future returns more accurately than using the historical mean of the observed returns? Does any predictor claim to be statistically significant, in the sense of explaining more than a random proportion of the variation in the returns at the usual 0.05 level? 2. Use stepwise regression to build a second regression model, one that allows interactions between lags and quadratic effects. 1 Use the AIC rule and select the “no rules” option. Make the model, summarize its results and save its prediction formula. Does this regression promise to predict future returns more accurately than using the historical mean of the observed returns? Does any predictor claim to be statistically significant, in the sense of explaining more than a random proportion of the variation in the returns at the usual 0.05 level? 3. Build a third regression using forward stepwise but with the p-to-enter set to the Bonferroni threshold appropriate for the forward selection. Again, save the model’s prediction formula (for Q5). Show a summary of the resulting model. Which features are statistically significant predictors of future returns? 4. According to believers in an efficient market, what model should be found? Was it? 2 5. Use all 3 models to predict returns in November and December. (a) Which model is most accurate? Be sure to indicate how you define “accuracy.” (b) Which model is the most “honest?” That is, which of the three models provides an “honest” estimate (at the time you can fit it) of its actual forecasting performance? The data set for Q5-Q10 (solvency.jmp) gives financial ratios for 66 firms, of which half declared bankruptcy within 2 years of the shown data (Solvent = “no”). These data were originally used to develop the Altman z-score that is often used to rate the default risk of corporate bonds. The problem is to predict, using logistic regression, which companies will go bankrupt given these financial ratios. The column names for the predictors are self-explanatory; all are common financial ratios. 6. Estimate the one-predictor logistic regression of Solvent on Working Capital/Assets, the ratio of capital available to total assets. Show a summary of your model and briefly interpret the estimated coefficients. 7. Does Working Capital/Assets have a statistically significant effect upon solvency? 8. Using the model fit in Q6, compare the likelihood of bankruptcy for two firms, one with ratio Working Capital/Assets = 30 and another with Working Capital/Assets = 40. (a) What is the difference between 1 To define the interactions, select the 12 lagged variables, click the “Macro” button, and choose the response surface option. That will add the 12 lags as well as the interactions and squares of the lags to the list of explanatory variables to be considered by the stepwise search. An interaction of this type implies “synergies” among the changes from prior values, such as “lag 2 is useful only when lag 3 is large.” 2 Those interested in predicting markets might want to try fitting the stepwise model using the full year of data for an interesting surprise.

Documenta5

Embed Size (px)

DESCRIPTION

acn

Citation preview

  • Statistics 622 Fall 2011

    Assignment #5 Yes, the last one!

    Use the daily_returns_2007.jmp data. This data file includes daily returns on the value-weighted stock market index (VWRETD) for 2007 (from CRSP via WRDS. It is easy to build a model that suffers from over-fitting. Let yt stand for the returns on the value-weighted index in time period t (as in a column of the data file). Twelve columns in this file, named Lag 01 through Lag 12, are lags of yt. The lag of a variable collected over time identifies preceding observations. For example, the 5th lag of yt is the value 5 rows above, lag(yt,5) = yt-5. Lags can be very useful for prediction. For example, one might try to predict tomorrows return yt based on todays yt-1 by regressing yt on yt-1. JMP has a formula that defines lags. For these questions, build a model using August through October as the estimation sample (65 days, rows 146-210) to predict the value-weighted index in November and December (41 days, rows 211-251).

    1. Fit a multiple regression with response VWRETD on its 12 lags in columns Lag 01, , Lag 12. Dont use stepwise regression; just fit a model with all 12 lags as explanatory variables. Print the summary of your model (include the R2, Anova table, and estimated coefficients).

    Does this regression promise to predict future returns more accurately than using the historical mean of the observed returns? Does any predictor claim to be statistically significant, in the sense of explaining more than a random proportion of the variation in the returns at the usual 0.05 level?

    2. Use stepwise regression to build a second regression model, one that allows interactions between lags and quadratic effects.1 Use the AIC rule and select the no rules option. Make the model, summarize its results and save its prediction formula.

    Does this regression promise to predict future returns more accurately than using the historical mean of the observed returns? Does any predictor claim to be statistically significant, in the sense of explaining more than a random proportion of the variation in the returns at the usual 0.05 level?

    3. Build a third regression using forward stepwise but with the p-to-enter set to the Bonferroni threshold appropriate for the forward selection. Again, save the models prediction formula (for Q5). Show a summary of the resulting model. Which features are statistically significant predictors of future returns? 4. According to believers in an efficient market, what model should be found? Was it?2 5. Use all 3 models to predict returns in November and December.

    (a) Which model is most accurate? Be sure to indicate how you define accuracy. (b) Which model is the most honest? That is, which of the three models provides an honest estimate (at the time you can fit it) of its actual forecasting performance?

    The data set for Q5-Q10 (solvency.jmp) gives financial ratios for 66 firms, of which half declared bankruptcy within 2 years of the shown data (Solvent = no). These data were originally used to develop the Altman z-score that is often used to rate the default risk of corporate bonds. The problem is to predict, using logistic regression, which companies will go bankrupt given these financial ratios. The column names for the predictors are self-explanatory; all are common financial ratios. 6. Estimate the one-predictor logistic regression of Solvent on Working Capital/Assets, the ratio of capital available to total assets. Show a summary of your model and briefly interpret the estimated coefficients. 7. Does Working Capital/Assets have a statistically significant effect upon solvency? 8. Using the model fit in Q6, compare the likelihood of bankruptcy for two firms, one with ratio Working Capital/Assets = 30 and another with Working Capital/Assets = 40. (a) What is the difference between

    1 To define the interactions, select the 12 lagged variables, click the Macro button, and choose the response surface option. That will add the 12 lags as well as the interactions and squares of the lags to the list of explanatory variables to be considered by the stepwise search. An interaction of this type implies synergies among the changes from prior values, such as lag 2 is useful only when lag 3 is large. 2 Those interested in predicting markets might want to try fitting the stepwise model using the full year of data for an interesting surprise.

  • Statistics 622 Fall 2011

    the probabilities of default? (b) Would you get a different answer if the values were 20 and 30? (No need to do it, just explain.) 9. Show and interpret the ROC curve of this logistic regression.3 What does the ROC curve reveal about the ability of this model to classify companies? (At least, if we believe the data are representative.) 10. Suppose we predict a firm to go bankrupt if the estimated probability of solvency from the model in #1 is less than . Give the sensitivity and specificity of this classification rule. Locate the performance of this classification rule on the ROC curve shown in Q9.

    Bonus. Does this logistic regression appear calibrated? Show a plot that indicates the degree of calibration. (Hint: Use the same plot that we use to check the calibration of a simple linear regression to check the calibration of the model. If you make the response a numerical 0/1 dummy variable rather than categorical, you will find that it works out very easily.)

    3 Use the option obtained by clicking on the red triangle in the output summary of the logistic regression to view the ROC curve. If you fit the one-predictor model using the Fit Model tool, you can also save the prediction formulas that determine the ROC curve.