33
Statistical Analysis on real estate and Asian population By Weicheng (Alex) Guo, Natthaporn (Pear) Loetsakulch, Ihsanullah Shagiwal, Tom Bopp

Math240-Final Project-Final Version 11-18-14 - With corrections

Embed Size (px)

Citation preview

Page 1: Math240-Final Project-Final Version 11-18-14 - With corrections

Statistical Analysis on real estate and Asian population

By Weicheng (Alex) Guo, Natthaporn (Pear)

Loetsakulch, Ihsanullah Shagiwal, Tom Bopp

Page 2: Math240-Final Project-Final Version 11-18-14 - With corrections

Introduction:

In the analysis of the data we collected the following outcomes are interpreted using basic

statistics. The Asian population in California from 2000 to 2010 was increasingly going up. Real

estate price increased by nearly $50 thousand throughout the decade. Total industrial

establishments that were built in California from 2000 to 2010 increased by 50,000. Crime rate

declined from 2000 to 2010 by -93,271 crimes reported. Unemployment’s net effect was that it

rose by approximately 2,120,000. Central tendency is usually interpreted by mean, median,

mode, and standard deviation. All of the factors in the descriptive summary benefit to measure

the shape of the variables, telling us a story about the variables. The boxplot also determines the

five number summary. The coefficient of determination measures the proportion of variation in

X (Asian Population) that is explained by the variation in the independent variable Y (Real

Estate Value or Industrial Establishments, or Crime, or Unemployment) in the regression model.

Simple linear regression analysis is a process to figure out the relationship between two

variables. We will conclude with simple linear regression as a final component in using numbers

and techniques to interpret data for basic statistics.

Page 3: Math240-Final Project-Final Version 11-18-14 - With corrections

Alex: Asian population plays an important role in California. The purpose of this part is

to figure out the relationship with real estate price in California.

Pear: The number of industrial establishments in California was chosen to analyze

because number of Asian population in California has been increasing every year. Lots of Asian

people came to the U.S. and they tend to invest in businesses or build their own business in the

U.S.

Tom: Crime rates in California was selected in conjunction with an increasing Asian

population in California. The purpose of this part is to figure out the roles and relationships that

an increasing Asian population plays along with crime rate.

Ihsanullah: Considering the available data for the period of 2000-2010, Asian population

has been increasing in California. This part of the research will basically focus on finding out the

relationship of Asian population with unemployment rate in California.

Independent Data: Asian Population in California

Page 4: Math240-Final Project-Final Version 11-18-14 - With corrections

Exhibit #1

Exhibit #1 is the graph of the Asian population in California from 2000 to 2010. The

trend of this figure was increasingly going up,

Dependent Date #1: Real Estate Price in California

Exhibit #2

Page 5: Math240-Final Project-Final Version 11-18-14 - With corrections

Exhibit #2 is the real estate price in California from 2000-01 to 2010-01. This data only

includes new purchases of houses, excluding refinancing. As can been seen, the main tendency

from 2000 to 2006 was rising up, beginning about over $100 thousand to $280 thousand as the

peak. After then, it tends to fall down from $280 thousand in 2006 to $160 thousand in 2009. It

remained almost the same level during the next year. The net effect was that real estate price

increased by nearly $50 thousand throughout the decade.

Dependent Data #2: Industrial Establishments in California

Exhibit #3

Exhibit #3 shows total industrial establishments that were built in California from 2000 to

2010. This data contained all business sizes and types. As can be seen, the number of business

trend from 2000 to 2007 was increasing. The number of business establishments in California in

2000 was about 800,000 industries. In addition, the highest period of establishing new business

in California was in 2007. The number of business was nearly 900,000 companies. However, the

tendency has been decreasing since 2008. This data would be suggested that the trend of

industrial establishments in California would continue to drop in the following year.

Page 6: Math240-Final Project-Final Version 11-18-14 - With corrections

Dependent Date #3: Crime Rate in California

Exhibit #4

Exhibit #4 is the chart for Crimes for all of California (in thousands), from 2000 through

2010. This data only includes crimes confirmed and reported. As can be seen, the main

tendency from 2000 with crime rate of 1,331,514 through 2004’s peak of 1,423,580 crime rate,

there was an increase in crime, thereafter from 2004 throughout 2010 crime actually declined

from 1,423,580 in 2004 to 1,238,243 in 2012.

Dependent Date #4: Unemployment in California

Page 7: Math240-Final Project-Final Version 11-18-14 - With corrections

Exhibit #5

The above chart (Exhibit #5) indicates the number of unemployment population in

California shown on yearly bases from 2000 to 2010. The data as shown on the chart represents

ups and downs in the number of unemployment population during the mentioned period of time.

For example, in 2000 the number of unemployed population in California was 8033,237???

increasing each year until it reaches to 1,191,744 in 2003 and from then it started declining each

year until it reaches to 872,567 in 2006. And then again, from 2007 the number of

unemployment started increasing each year and becomes 2,259,942 in 2010.

Descriptive Summary

Descriptive summary is a measure to figure out central tendency for the variables. Central

tendency is usually interpreted by mean, median, mode, and standard deviation. Meanwhile,

Kurtosis, Skewness, First Quartile, Third Quartile, and Interquartile Range are the deeper factors

to explain the variables. All of the factors in the descriptive summary benefits to measure the

shape of the variables.

Page 8: Math240-Final Project-Final Version 11-18-14 - With corrections

House Price in

California

Industrial

establishments

in California

Crime Rate in

California

Unemployment in

California

Mean 191.15 846778.27 1356993.273 1245672.091

Median 167.37 849875 1369224 1094277

Mode N/A N/A N/A N/A

Minimum 111.69 799863 1238243 833237

Maximum 285.20 891997 1431269 2259942

Range 173.51 92134 193026 1426705

Variance 3410.00 926472229.4 3919777289.82 233685766023.6910

Standard

Deviation

58.40 30438.01 62608.1248 483410.5564

Coeff. of

Variation

30.55% 4.61% 38.81%

Skewness 0.4077 -0.1342 -0.6908 1.5560

Kurtosis -1.0718 -1.1166 -0.3057 1.2989

Count 11 11 11 11

First Quartile 142.33 827472 1328826 932073

Third Quartile 251.37 878128 14171781332270

Interquartile

range

109.04 50656 88352 400197

Skewness which is a factor which can display the variables distribution by using the five-

number summary, the smallest value, the first quartile, the median, the third quartile, and the

Page 9: Math240-Final Project-Final Version 11-18-14 - With corrections

largest value. If skewness equals zero, variable is the perfectly symmetrical distribution.

Otherwise, it will be one of the other two distributions, which are left skewed (negative) and

right skewed (positive). It will clearly show what distributions the variable is in the Boxplot.

#1: Real Estate Price in California

110 160 210 260 310

Housing Price Index - Purchase Only California Non-Seasonally Adj HPI

Index

Boxplot

Exhibit #6

The Boxplot is also describing the five-number summary. In the Boxplot of Housing

Price Index, it is the Right-Skewed Distribution because the skewness equals 0.4077 greater than

zero. Thus, the distance of the smallest value is close to the median than the largest value one. In

other words, in this case of house purchasing activities, most of the activities occurred on the

range from about 110 to 210.

Page 10: Math240-Final Project-Final Version 11-18-14 - With corrections

#2 Industrial Establishments in California

806730 816730 826730 836730 846730 856730 866730 876730 886730 896730

Industrial Establishments

Exhibit #7

The number of industrial establishments is symmetric. It indicates that the mean is quite

similar to median. There are few outstanding values in this data set. The average numbers are

between 826,730 to 876,730. Therefore, the number of establishing new businesses in California

is approximately symmetrically distributed data.

#3 Crime Rate in California

Page 11: Math240-Final Project-Final Version 11-18-14 - With corrections

1238240 1288240 1338240 1388240 1438240 1488240 1538240

Crime all of California 2000 - 2010

Boxplot

Exhibit #8

The above data reveal the distribution of the crime rate from 2000 through 2010. The

boxplot also determines the five number summary. In the box plot of crime the crime rate is

slightly negatively left skewed (skewness of -0.06908) less than 0. From the data for crime the

mean (or average) crime rate was 1,356,993 crimes, with a minimum crime rate of 1,238,243,

and a maximum crime rate of 1,431,269 crimes. 25% of the crimes occurred in the first quartile

with a value of 1,328,826 crimes, and 75% of the crimes were 1,417,178. Half of the crimes

were 1,369,224 crimes.

#4: Unemployment in California

Page 12: Math240-Final Project-Final Version 11-18-14 - With corrections

833230 1033230 1233230 1433230 1633230 1833230 2033230 2233230 2433230

Non-Seasonally Adjusted Unemployment California #

Boxplot

Exhibit #9

The above boxplot shows that the distribution is right-skewed. Considering the data the

skewness of unemployment in California is 1.5560 which is way greater than zero. This means

the mean is greater than the median. Most of the data are comparatively small and there are

exceptionally larger data which pulls the mean to the right side of the boxplot.

Page 13: Math240-Final Project-Final Version 11-18-14 - With corrections

Correlation Coefficient Analysis make sure when you talk about your data…………………

under your multiple regression analysis and scatter plot….I’ll finish the conclusion…just

want to pull a sentence from each of your scatter plots…please revise your scatter plots…

X= Asian Population, and Y= Real Estate, Industrial Establishment, Crime, Unemployment

The coefficient of determination measures the proportion of variation in Y (Asian

Population) that is explained by the variation in the independent variable X (Real Estate Value or

Industrial Establishments, or Crime, or Unemployment) in the regression model.

#1 Real Estate Value in California

Asian Population in CA Real Estate Value in CA

Asian Population in CA 1

Real Estate Value in CA 0.38509275 1

Exhibit #10

In Chart #4, the correlation coefficient is approximately 0.385, nearly close to 0. It means

that Asian Population in CA did not have strong relationship with Real Estate Value in CA.

#2 Industrial Establishments in California

 Asian Population in CA Industrial establishments in CA

Asian Population in CA 1  

Industrial establishments 0.684948369 1

Page 14: Math240-Final Project-Final Version 11-18-14 - With corrections

Exhibit #11

The coefficient of correlation between number of industrial establishments and Asia

population in California is about 0.68. It indicates that the linear relationship between these two

variables is quite strong because the number is close to 1.

#3 Crime Rate in California

 Asian Population in CA Crime rate in CA

Asian Population in CA 1  

Crime rate in CA .5404 1

Exhibit #12

The coefficient of determination is .5404. Therefore, 54.04% of the variation in Asian

population is explained by the variability in the number of crime rates. Just over 50%, this weak

to modest linear relationship between these two variables exists because the regression model

has explained 54.04% of the variability in predicting the Asian population. Less than half, or

only, 45.96% of the sample variability in the Asian population is due to factors other than what is

accounted for by the linear regression model that uses the crime rates in California

#4 Unemployment in California

Asian Population Unemployment in CA

Asian Population 1

Unemployment in CA 0.7877 1

Page 15: Math240-Final Project-Final Version 11-18-14 - With corrections

Exhibit #10

According to the data there is strong positive linear relationship between the number of

Asian population and unemployment as the value 0.7877 is very close to 1. This means that,

87.77% of the variation in unemployment is explained by the variation in Asian position.

Simple Linear Regression Analysis

Simple Linear Regression Analysis is a measure to figure out the relationship between

two variables. For example, the two variables are the dependent variable (real estate value in

California) and the independent variable (Asian population in California). It uses to identify the

sort of mathematical relationship that exists between the dependent variable and the independent

variable, and the unusual observations.

#1 Real Estate Value in California and Asian Population in California

Simple Linear Regression Analysis

Regression StatisticsMultiple R 0.3851R Square 0.1483Adjusted R Square 0.0537Standard Error 342342.2639Observations 11

ANOVAdf SS MS F Significance F

Regression 1 183656270483.1630 183656270483.1630 1.5671 0.2422Residual 9 1054784030687.5600 117198225631.9520Total 10 1238440301170.7300

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%Intercept 3916755.3668 369089.2889 10.6119 0.0000 3081817.3881 4751693.3454 3081817.3881 4751693.34538Housing Price Index - Purchase Only California Non-Seasonally Adj HPI Index2320.7357 1853.8873 1.2518 0.2422 -1873.0487 6514.5202 -1873.0487 6514.52016

Exhibit #13

From the Exhibit #13, when the level of significance is 0.05, the p-value of housing price

index is 0.2422>0.05. Thus, we can conclude that there is not a linear relationship between Asian

Page 16: Math240-Final Project-Final Version 11-18-14 - With corrections

population in California and real estate value in California. In order to give you a big picture, let

display the Scatter Plot (Chart #5) for this two variables:

Exhibit #14

It will be represented as an equation: Y=AX+B. Y is the Asian Population in California

(Independence Data), X is the Real Estate Value (Dependence Data). A is the slope which means

that for each increase of 1 unit in X, Y will increase by A units. B is a constant number. In Chart

#5. It shows that the equation is Y=2320.7X+4E+06. In other words, Y and X is the positive

relationship due to the A=2320.7>0. In addition, R2=0.1483, which means that 14.83% of the

100 120 140 160 180 200 220 240 260 280 3000

1000000

2000000

3000000

4000000

5000000

6000000

f(x) = 2320.73573941192 x + 3916755.36675568R² = 0.148296426004184

Scatter Plot

X

Y

Page 17: Math240-Final Project-Final Version 11-18-14 - With corrections

variation in the percentage of real estate value in California is explained by the variation in the

Asian population in California.

#2 Industrial Establishments in California and Asian Population in California

Simple Linear Regression Analysis

Regression StatisticsMultiple R 0.6849R Square 0.4692Adjusted R Square 0.4102Standard Error 270271.6223Observations 11

ANOVAdf SS MS F Significance F

Regression 1 581019552597.1430 581019552597.1430 7.9541 0.0200Residual 9 657420748573.5840 73046749841.5093Total 10 1238440301170.7300

Coefficients Standard Error t Stat P-valueIntercept -2345419.9925 2379079.0304 -0.9859 0.3500Number of Industrial establishments in California7.9192 2.8079 2.8203 0.0200

Exhibit #15

From this information, while using the 0.05 level of significance, p-value of industrial

establishments’ number is 0.02000. It is less than 0.05. Therefore, it can conclude that there is

relationship between Asian population and number of business establishments in California.

Page 18: Math240-Final Project-Final Version 11-18-14 - With corrections

780000 800000 820000 840000 860000 880000 9000000

1000000

2000000

3000000

4000000

5000000

6000000

f(x) = 7.91916096931599 x − 2345419.99250118R² = 0.469154267709062

Scatter Plot

X

Y

Exhibit #16

From the scatter plot, the simple linear regression equation is Y = 7.9192x – 2E + 06.

Thus, the number of business establishment increases approximately as a straight line. Moreover,

from the simple linear regression showed that r2 = 0.4692, which means 46.52% of the variation

can be explained by the variation in the independent variables.

Page 19: Math240-Final Project-Final Version 11-18-14 - With corrections

#3 Crime Rate in California and Asian Population in California

Simple Linear Regression Analysis

Regression StatisticsMultiple R 0.7351R Square 0.5404Adjusted R Square 0.4893Standard Error 251489.9156Observations 11

ANOVAdf SS MS F Significance F

Regression 1 669215702456.9300 669215702456.9300 10.5810 0.0100Residual 9 569224598713.7970 63247177634.8663Total 10 1238440301170.7300

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%Intercept 9967347.0471 1725390.4055 5.7769 0.0003 6064242.7827 13870451.3115 6064242.7827 13870451.31148Crime Rate in California -4.1319 1.2703 -3.2528 0.0100 -7.0054 -1.2584 -7.0054 -1.25841

Exhibit #17

Simple linear regression analysis is a process to figure out the relationship between two

variables. Specifically, the two variables are the dependent variable X (Asian Population in

California) of approximately (9,967,347) and the independent variable (Crime Rate in

California) (-4.1319). It is utilized to explain any sort of relationship that may or may not exist

between the dependent variable, and the independent variable, along with anything unusual. The

slope, b1, is -4.1319. This means that for each decrease of 1 unit in X(Asian Population), the

predicted value of Y(Crime Rate) is estimated to decrease by -4.1319 units. Thus the slope

represents the portion of the crime rates that are estimated to vary according to the number of

Asians in the population. The Y-intercept, b0, is (9,967,347.0471). The Y-intercept represents

the predicted value of Y when X equals 0. Interpretations of the Y-Intercept value should be

made with caution, because Y-intercept can never be zero and may be outside of the range of

relevant values.

Page 20: Math240-Final Project-Final Version 11-18-14 - With corrections

1200000 1250000 1300000 1350000 1400000 14500000

1000000

2000000

3000000

4000000

5000000

6000000

f(x) = − 4.13192438402224 x + 9967347.04708142R² = 0.540369771416761

Scatter Plot

X Asian Population

Crim

e ra

teY

Exhibit #18

Depicted above is an example of a negative linear relationship. As X (Asian Population)

increases, the values of Y (Crime rate) are generally decreasing. An example of this type of

relationship might be the crime rate and Asian population in California. In other words, as the

Asian population increases, Crime rate decreases. This explanation can be explained also by the

equation in the scatter plot of y = -4.1319x + 1E + 07. The R-squared value of (R^2: 0.5404)

indicates a weak to modest linear relationship between these two variables because the

regression model has explained 54.04% of the variability in predicting the Asian population.

45.96% of the sample variability in Asian population is due to factors other than what is

accounted for by the linear regression model that uses crime rate.

Page 21: Math240-Final Project-Final Version 11-18-14 - With corrections

#4 Unemployment in California and Asian Population in California

Simple Linear Regression Analysisb1, b0 Coefficients 0.5734 3646070.8850

Regression Statistics b1, b0 Standard Error 0.1495 198570.6756Multiple R 0.7877 R Square, Standard Error 0.6204 228541.2773R Square 0.6204 F , Residual df 14.7108 9.0000Adjusted R Square 0.5783 Regression SS , Residual SS 768360262154.6160 470080039016.1110Standard Error 228541.2773Observations 11 Confidence level 95%

t Critical Value 2.2622ANOVA Half Width b0 449198.0761

df SS MS F Significance F Half Width b1 0.3382Regression 1 768360262154.6160 768360262154.6160 14.7108 0.0040Residual 9 470080039016.1110 52231115446.2346Total 10 1238440301170.7300

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%Intercept 3646070.8850 198570.6756 18.3616 0.0000 3196872.8089 4095268.9611 3196872.8089 4095268.96105Unemployment in California (number of people)0.5734 0.1495 3.8355 0.0040 0.2352 0.9116 0.2352 0.91161

Calculations

Exhibit #19

The above figure enables us to develop a model to predict the values of Asian population

based on the numbers of unemployment in California. The slope, b1 is 0.5734. This means that

for each increase of 1 unit in X (Unemployment) the value of Y (Asian Population) is estimated

to increase by 0.5734. This slope represents portion of annual variation in Asian population that

are estimated to vary according to the variation in number of unemployment in California. The Y

intercept, b0, is 3646070.8830. This represents the value of Y (Asian population) when X

(Unemployment) equals to zero.

Page 22: Math240-Final Project-Final Version 11-18-14 - With corrections

600000 800000 1000000 1200000 1400000 1600000 1800000 2000000 2200000 24000000

1000000

2000000

3000000

4000000

5000000

6000000

f(x) = 0.573411393551757 x + 3646070.88498874R² = 0.620425757647152

Scatter Plot

X (Unemployment)

Y (A

sian

Popu

latio

n)

Exhibit #20

As per the above scatter plot, the simple linear regression equation is Y = 0.5734x + 4E +

06. Therefore, the number of business establishment increases as a straight line. Moreover, from

the simple linear regression shows that r2 = 0.6204, which means 62.04% of the variation in

Asian population can be explained by the variation in the unemployment rate.

Page 23: Math240-Final Project-Final Version 11-18-14 - With corrections

Conclusion:

Simple linear regression analysis is a key process to figure out the relationship between

two variables. Specifically, the two variables that are discussed in this research; the dependent

variable X (Asian Population in California) and the independent variable (Real Estate, Industrial

Establishments, Crime, and Unemployment). It is utilized to explain any sort of relationship that

may or may not exist between the dependent variable, and the independent variable, along with

anything unusual. 14.83% of the variation in the Asian population is explained by the variation

the percentage of real estate value in California. 46.52% of the variation in the Asian population

can be explained by the variation in the number of industrial establishments in California.

54.04% of the variation in Asian population is explained by the variability in the number of

crime rates in California. 62.04% of the variation in Asian population can be explained by the

variation in the unemployment rate in California. Our numbers, research and modeling has lead

us to find that real estate and industrial establishments were weak indicators with the Asian

population in California. Further, crime rate and unemployment showed to be stronger indicators

of a relationship with the Asian population in California.

Page 24: Math240-Final Project-Final Version 11-18-14 - With corrections

Reference

Bureau of Labor Statistics. February 02, 2009. Employment and unemployment. Retrieved from

http://statisticaldatasets.data-planet.com/dataplanet/Datasets.html#I|1498d7f791654

Criminal justice information service division, Federal Bureau of investigation. August 20, 2012.

Crime reported: Crimes confirmed. Retrieved from http://statisticaldatasets.data-

planet.com/dataplanet/Datasets.html#I|1499cb47f3973

Federal Housing Finance Agency, November 18, 2013. House price index – Purchase only.

Retrieved from http://0-statisticaldatasets.data-planet.com.library.ggu.edu/dataplanet/

Datasets.html#I|14972e727d98

Total Industrial Establishments from the County Business Patterns by NAICS Code (1998-2002)

Dataset-0 Total State California 2003 - 2012. (2014, September 30). Retrieved from

http://statisticaldatasets.data-planet.com/dataplanet/Datasets.html#I|14977e70e0132

Total Establishments (NACIS) (1998-2002) from the County Business Patterns by NAICS Code

(1998-2002). (2014, September 30). Retrieved from http://statisticaldatasets.data-

planet.com/dataplanet/datasheet/Datasheet.jsp?sessionID=77602ba6-0587-420c-841c-

0ef20a414491&viewID=TREND%7COrg5t%7C%7CTimeUnit0%7COrg5LIST

%4006075%2COrg4LIST%4006%7CUserCube39Dim0%400%7C%7C

%7Cfalse&sid=28edea0d80c150%2431a&param=7fuKsUAw6JIlRs8qM3vA1Bdiq7lCp

OL5mtnzyvW0WnrNETvC0ue4cQD5tNSYAZK_UphPOiR_JbWZT7Vy3WhQZ1fmiHF

AQJbcjmbHy_pE0wsNT4ScNu1bgTLUuyvTM33bLmudAFlZuUmYpf7xbk0yJ1g-

bMFSa01zjrutFVzMvAF2lHg0dbb64FBhOB6QfkFe6eTAJYtI337sTZPpmnhENhlD8Zq

nKfnWdPuN-

Page 25: Math240-Final Project-Final Version 11-18-14 - With corrections

HL7x_idge5nhPcCN0Coo8rnDnSSFduuHIWDfDbntTsBSCbTOidNu8yG0SjsY6I_pdq

MB9fizUq955_Gw12TnNR23BGAQ8lsTgO6htGJp1kvbyqWSg%3D

%3D&customData=-1%7C0%7Ctrue

United States Census Bureau, November 07, 2013. Population estimates – Detail (1980 -

Present). Retrieved from

http://statisticaldatasets.data-planet.com/dataplanet/Datasets.html#I|1499cf172c575