Big Data Project - Final version

Machine Learning Analysis of Micro & Macroeconomic Variables

Alakar Srinivasan Pinank Shah Tjisana Kerr

Mihir SanghaviVinayak Kishanchandani

Background & Objectives

Background:

S&P Global 500 is a popular index consisting of 500 of the largest listed global companies in terms of market capitalization

Stock index prediction is a relatively old but challenging problem to solve using new techniques

The other challenging problem is to devise portfolio optimization strategies to outperform the index

Key Challenges: Stock markets can be extremely volatile and

reactive to new events and information

Efficient Market Hypothesis suggests that stock markets cannot be predicted or outperformed

Objective:

Compare performance of various statistical and ML prediction techniques in estimating future daily returns of the S&P index since stock market prediction is enticing and continues to remain one of the ultimate challenges

2009 2010 2011 2012 2013 2014 20150

5

10

15

20

8.4

12 13 12.9 13.615

16.2

S&P Global 100 Companies: Market Capitalization ($ Billion)

Mar

ket C

ap (

Billi

on $

)

List of Prediction Methods For Comparison

1. Linear Regression

2. Lasso Regression

3. Hold Winter Filtering

4. K Nearest Neighbors Algorithm

5. Support Vector Regression

1. Linear Regression

About Linear Regression:

• A linear model that determines the relationship between a dependent variable (daily returns) and set of explanatory variables, in this case sector specific returns, exchange rate, commodities price, etc.

• One of the first regression techniques to be studied extensively

• Linear models are fitted using Ordinary Least Square or OLS approach

• Pros: Simple to execute, easy to implement, easy to interpret model results, fast in processing

• Cons: Linear model will not be able to fit non linearity which exists in the context of stock market prediction

• Results not reliable when very few data points are available while over fits when large amount of data is used for training

Observation / Result:

The model included all variables except one (enterprise index) for which p value was greater than 0.05 and hence removed.

Therefore all other factors have been considered and performance of model was as follows - RMSE: 47.44

Actual vs Predicted Values

2. Lasso RegressionAbout Lasso Regression:

• Lasso stands for Least Absolute Shrinkage and Selection Operator and belongs to class of generalized linear models

• Lasso regression performs both variable selection as well as regularization in order to improve prediction accuracy performance

• Pros: Lasso regression are typically better than linear regression models as it includes variable selection as well as regularization which helps reduce over fitting

• Lasso is better than ridge regression in that it performs both parameter shrinkage and selection while ridge can perform only regularization

• Cons: Although Lasso can be extended to generalized linear models, its performance in terms of predicting non linear relationships is constrained

Result: Using LASSO few additional factors (1) US10 year bond

(2) Health (3) Telecom turned out to be not significant for the model and hence were eliminated due to high p value

Due to further variable selection and coefficient shrinkage, the prediction performance of the model improved to RMSE: 43.34

Observation:

• Using LASSO we found that the factors: (1) US10 year bond (2) Health (3) Telecom are not significant for the model and can be eliminated since

3. Holt Winters Filtering

About Holt Winter:

• Holt Winters filtering can be used to perform the exponential smoothing technique which is a simple forecasting method for time series data

• The method can apply as many as three low pass filters recursively with exponential windows and defined with parameters alpha, beta and gamma

• Exponential smoothing is different from simple moving average in that it assigns exponentially decreasing weights to historical data instead of equal weights in the time window

• Pros: Better than SMA in that it determines the right parameters to assign exponentially decreasing weights to historical data to improve prediction performance

• Cons: Although simpler to execute than more complex time series forecasting techniques such as ARIMA, prediction accuracy is lower in comparison

Observation / Result: Since the daily returns data is stationary and non

seasonal, only the alpha parameter was used to perform exponential smoothing.

An alpha of 0.191 was found to minimize the SSE although in terms of prediction performance exponential smoothing was found to be much lower than others

4. K Nearest Neighbors

About K-NN:

• The KNN algorithm is a method for classifying objects based on closest training examples in the feature space

• The KNN is the fundamental and simplest classification technique when there is little or no prior knowledge about the distribution of the data

• In addition to classification, KNN can also be used for regression and prediction. The average response of the K nearest neighbors is taken as the predicted value

• Pros: Computationally efficient and can provide higher accuracy if there is good correlation between historical and future events

• Pros: Performs in an unsupervised fashion and hence there is no requirement to train the model

• Cons: Input parameters and dimensions need to be scaled appropriately in order for KNN to perform well

Observation / Result: • KNN regression with k=5 gives the best RMSE of

15.5498 on the test data• To predict tomorrow’s close of S&P500 we find the k-

nearest neighbors of today’s observations from the beginning of data (t=1) and then compute the mean of tomorrows’ closing prices of those k nearest neighbors

0 200 400 600 800 1000 12001200

1300

1400

1500

1600

1700

1800

1900

2000

2100

2200

S&P500Predicted

k0 5 10 15 20 25 30 35 40

RM

SE

15

16

17

18

19

20

21

22

23

24

25

5. Support Vector Regression

Observation / Result: As observed from the charts, SVR is able to predict the

daily returns better than other techniques such as exponential smoothing and linear regression

SVR took about 2 hours to run the model compared to few seconds for most others. However performance was as follows - RMSE: 11.25

About SVR:

SVM technique can also be used as regression method, maintaining all the main features that characterize the algorithm (maximal margin)

In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM to predict real values

This makes the algorithm more complicated than SVM therefore to be taken in consideration

Pros: Can function well with high number of variables, high number of observations, can model non linearity better than simple regression

Cons: A complicated algorithm and output which makes it difficult to interpret results and identify drivers

Also very time consuming as compared to other models during model training

Comparison of Techniques / Results

Parameters Linear Regression

Lasso Regression

Holt Winter Filtering K-NN Support Vector

Regression

RMSE 47.44 43.34 71.65 15.5 11.25

Processing Time Very Low Low Low Medium High

Actionability High High Low Low Medium

Implementation Ease High Medium High Medium Low

Conclusion and Further Research

Key takeaways from stock market prediction analysis:

Linear Regression has a considerably high RMSE value possibly because of the fact that the stock market does not follow a linear model.

Best Performing Model: Support Vector Regression is the best model which is clearly evident from the RMSE values as show in the previous slide.

SVR being non linear is a better model compared to all others which have been used.

In terms of processing time, Support Vector Regression actually took the maximum time to process as shown in the previous slide.

List of Time Series Models for Equity Factors

1. Linear Approximation

2. I.I.D. Analysis

3. Technical Indicators

4. Neural Networks

1. Linear ApproximationAbout Linear Approximations:

• Linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in some arbitrary dataset and the responses predicted by the linear approximation of the data.

• Pros: Simple model.

• Cons: Patches of positive and negative residuals if curved data.

• Polynomial regression is the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial. It fits a nonlinear relationship between the value of x and the corresponding conditional mean of y.

• Pros: Polynomial fits curvilinear terms better.

• Cons: There is a likelihood of over fitting when using the polynomial regression method.

Observations - R2:

Equity Linear Poly

MS 0.000575 0.125526

JPM 0.132070 0.655651

GS 0.092261 0.279183

2. I.I.D. AnalysisAbout I.I.D. Analysis:

A sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent

Observation / Result:Notice that the scatter plot is symmetrical with respect to thereference axes and it resembles a circular cloud, which implies that all the terms are identically distributed. Thus, Returns for both sets of data is an invariant, in particular, all the terms in the series are independent of each other.

3. Technical IndicatorsAbout Technical Indicators

• Momentum: Momentum is the rate of rise/fall of a security's price or volume. The idea of momentum in securities is that their price is more likely to keep moving in the same direction than to change directions.

• Relative Strength Indicator: The relative strength index (RSI) is a technical momentum indicator that compares the magnitude of recent gains to recent losses in an attempt to determine overbought and oversold conditions of an asset.

• Price Rate of Change: The price rate of change (ROC) is a technical indicator that measures the percentage change between the most recent price and the price "n" periods in the past.

• Price Volume Trend: A technical indicator consisting of a cumulative volume line that adds or subtracts a multiple of the percentage change in share price trend and current volume, depending upon their upward or downward movements.

• Bollinger Bands of 50 Day Moving Average: A Bollinger Band® is a band plotted two standard deviations away from a simple moving average, developed by famous technical trader John Bollinger.

3. Technical Indicators

4. Neural Networks

4. Neural Networks

4. Neural Networks

4. Neural Networks

3. Sources of Information

Documents

Big Data Project - Final version