Upload
ratnam-dubey
View
28
Download
1
Embed Size (px)
Citation preview
October 11th, 2016
Group: 5Big Data Combine Engineered By BattleFin
2
Partha S Satpathy Team Lead
The Team
Jorge Trevino
Mayuresh Indapurkar
Ratnam Dubey
3
Problem Statement
Inputs:244 Inputs. These represent sentiment data from several
sources like newspaper, twitter etc.
Outputs:198 stocks.
We have data of change in their values for every 5 min from 9
am to 1.30 PM
9 AM
1.30 PM
Our Goal : Predict the value of Outputs (198 of them) at 4 PM
4
Correlation among inputs
Linear Regression
Model
Predict the inputs at 4 PM using
Trend Line
Check if the model is correct
Predict future values
Project Overview
First find the Inputs which actually
drive the change in output
Find a relationship between the
Output and the independent
Inputs (Predictors)
We need the input first to calculate the
output at 4 PM
We are comparing predicted value
with the training data
Ta-da !!!!!We got the future change in Stocks values at 4 PM
5
• It is a single number (between 0 and 1) that describes the degree of relationship between two variables
• As one variable rises or falls, the other variable rises or falls as well.
Correlation – what is it ?
Variable1 Variable2
INCR
EASE
INCR
EASE
Positive Correlation
Variable1 Variable2
INCR
EASE
DECREASE
Negative Correlation
6
Correlation – Correlation among Inputs?
SOURCE
SOURCE
DRIVES
DRIVES
Stock Price Change
7
• Large number of input variables (244)
• Discard redundant variables – ones which do not affect the output– highly correlated variables
Correlation - Removing Correlation among Inputs
8
• cor(<data matrix>)
• Returns “Triangular Matrix”
• Shows correlation of every input variable with all other input variables
• Discard input variables having correlation > 0.3
• Approximately 3-5 input variables remain
Correlation - using R
9
• A linear regression is a simple and useful tool for predicting a quantitative response.
• Here we consider that there is a linear relationship between the Response and the Predictor.
Linear Regression - Introduction
Linear Regression Type
Simple Linear Regression
Multiple Linear Regression
10
• Simple linear regression is a very straightforward simple linear approach for predicting a quantitative response Y on the basis of a single regression predictor variable X.
• It assumes that there is approximately a linear relationship between X and Y .
• Mathematically, we can write this linear relationship as:
Simple Linear Regression
Y
X
11
• We use Multiple Linear Regression.• It is same as Simple Linear Regression, but the equation is extended for all the Predictors.• Mathematically speaking:
Multiple Linear Regression
what if we have more than one predictor (X)?
12
print(head(data.new,5))
lmO <- lm(data.new$O~.,data = data.new)print(coef(lmO))
Using Multiple Linear Regression in the Project
𝑌=β 0+ β1 𝑋 1+β 2 𝑋 2+β 3 𝑋 3O I16 I242 I244
13
Predicting Future Values at 4 PM - I
So Sheldon, -you got your equation
- You got , .. from the model- Now to calculate Y at 4 PM, you need X at 4
PM too- What is your plan with that? Do not worry Leonard. I got that
covered. We have Trend Line that is going to give us X values (Inputs) at
4 PM.
Trend line – Introduction
• A line indicating– direction of a process with
respect to time.– tendency of data with
respect to time
• Employed whenever time dependent data is available.
Trend line – Types
Trend Line
Linear
Non-linear
Polynomial
Exponential
Logarithmic
• No consideration for Probabilistic (Or Stochastic) nature of the process.
• Linear Trend line is Linear Regression with respect to “Time”.
• Function which can be used in R:– Lsfit() “Least Squares Fit”
• Return slope “m” and constant “c”
Linear trend line using R
• 244 input variables• Input variable values – known for 310 days– 9 am to 1:30 pm (at 5 minute intervals)– 55 values per variable per day– 55th time interval -> 1:30 pm– 85th time interval -> 4 pm
• Estimate input variable value at 4 pm
Trend line – how we’ve used it - I
• mc <- lsfit(start:end, day[start:end,col])
• print(mc$coefficients)
• x <- c(seq(9.0,9.55,.05),seq(10.0,10.55,.05),
seq(11.0,11.55,.05),seq(12.00,12.55,.05),seq(13.0,13.30,.05))
• plot(x,day[,col],xlab="Hour",ylab=names(day)[col])
• abline(mc)
Trend line – how we’ve used it - II
Input at 4 PM
X = 85 at 4 PM
19
Predicting Future Values at 4 PM - II
Wow Sheldon!!!-you got your equation
- You got , .. from the model- You got X values at 4 PM too- Did you find the Y values at 4
PM?
Of course I did, Leonard. I put the values in the equation and I found the future stock value change for 198 stocks for 310
days. Check it out.
20
Testing the model
I was wondering if we could test our Model is
estimating correct values or not?
Good thinking Leonard. We have 200 days of
actual outputs. I am going to compare our predicted value with actual value for
one stock. Here we go.
21
Data Can Be Confusing ?
I Can Interpret the Data Using
Graphs…
We Use Graphs to Organize Data.
22
Approach to use the type of Graph ?
Checking for the Purpose of the Graph and type of
Data to be used ??
Type of Data ??
Numeric and Stock Related Data
Selection of Graphs??
Line Graph
Line graphs are used to track changes over short and long periods of time.
When smaller changes exist, line graphs are better to use than bar
graphs.
23
How to plot the Graph ?
What Does these functions Do?
Gathering and Arranging the Data According to the need.
Comparing two data Minimum variance Stock Maximum variance Stock
Plotting the Data
24
Minimum and Maximum Variance Stock
Minimum Variance Stocks Maximum Variance Stocks
25
Results ?
Successfully predicted the Values at 4pm for 210 days
Probably After analyzing the Stock variance I would like to choose the stock having minimum variance !!
26
Minimum Variance Stock !!!! Maximum Variance Stock !!!! OR
27
THANK YOU !!
Questions ??