Upload
tom-bopp
View
65
Download
0
Embed Size (px)
Citation preview
Statistical Analysis on real estate and Asian population
By Weicheng (Alex) Guo, Natthaporn (Pear)
Loetsakulch, Ihsanullah Shagiwal, Tom Bopp
Introduction:
In the analysis of the data we collected the following outcomes are interpreted using basic
statistics. The Asian population in California from 2000 to 2010 was increasingly going up. Real
estate price increased by nearly $50 thousand throughout the decade. Total industrial
establishments that were built in California from 2000 to 2010 increased by 50,000. Crime rate
declined from 2000 to 2010 by -93,271 crimes reported. Unemployment’s net effect was that it
rose by approximately 2,120,000. Central tendency is usually interpreted by mean, median,
mode, and standard deviation. All of the factors in the descriptive summary benefit to measure
the shape of the variables, telling us a story about the variables. The boxplot also determines the
five number summary. The coefficient of determination measures the proportion of variation in
X (Asian Population) that is explained by the variation in the independent variable Y (Real
Estate Value or Industrial Establishments, or Crime, or Unemployment) in the regression model.
Simple linear regression analysis is a process to figure out the relationship between two
variables. We will conclude with simple linear regression as a final component in using numbers
and techniques to interpret data for basic statistics.
Alex: Asian population plays an important role in California. The purpose of this part is
to figure out the relationship with real estate price in California.
Pear: The number of industrial establishments in California was chosen to analyze
because number of Asian population in California has been increasing every year. Lots of Asian
people came to the U.S. and they tend to invest in businesses or build their own business in the
U.S.
Tom: Crime rates in California was selected in conjunction with an increasing Asian
population in California. The purpose of this part is to figure out the roles and relationships that
an increasing Asian population plays along with crime rate.
Ihsanullah: Considering the available data for the period of 2000-2010, Asian population
has been increasing in California. This part of the research will basically focus on finding out the
relationship of Asian population with unemployment rate in California.
Independent Data: Asian Population in California
Exhibit #1
Exhibit #1 is the graph of the Asian population in California from 2000 to 2010. The
trend of this figure was increasingly going up,
Dependent Date #1: Real Estate Price in California
Exhibit #2
Exhibit #2 is the real estate price in California from 2000-01 to 2010-01. This data only
includes new purchases of houses, excluding refinancing. As can been seen, the main tendency
from 2000 to 2006 was rising up, beginning about over $100 thousand to $280 thousand as the
peak. After then, it tends to fall down from $280 thousand in 2006 to $160 thousand in 2009. It
remained almost the same level during the next year. The net effect was that real estate price
increased by nearly $50 thousand throughout the decade.
Dependent Data #2: Industrial Establishments in California
Exhibit #3
Exhibit #3 shows total industrial establishments that were built in California from 2000 to
2010. This data contained all business sizes and types. As can be seen, the number of business
trend from 2000 to 2007 was increasing. The number of business establishments in California in
2000 was about 800,000 industries. In addition, the highest period of establishing new business
in California was in 2007. The number of business was nearly 900,000 companies. However, the
tendency has been decreasing since 2008. This data would be suggested that the trend of
industrial establishments in California would continue to drop in the following year.
Dependent Date #3: Crime Rate in California
Exhibit #4
Exhibit #4 is the chart for Crimes for all of California (in thousands), from 2000 through
2010. This data only includes crimes confirmed and reported. As can be seen, the main
tendency from 2000 with crime rate of 1,331,514 through 2004’s peak of 1,423,580 crime rate,
there was an increase in crime, thereafter from 2004 throughout 2010 crime actually declined
from 1,423,580 in 2004 to 1,238,243 in 2012.
Dependent Date #4: Unemployment in California
Exhibit #5
The above chart (Exhibit #5) indicates the number of unemployment population in
California shown on yearly bases from 2000 to 2010. The data as shown on the chart represents
ups and downs in the number of unemployment population during the mentioned period of time.
For example, in 2000 the number of unemployed population in California was 8033,237???
increasing each year until it reaches to 1,191,744 in 2003 and from then it started declining each
year until it reaches to 872,567 in 2006. And then again, from 2007 the number of
unemployment started increasing each year and becomes 2,259,942 in 2010.
Descriptive Summary
Descriptive summary is a measure to figure out central tendency for the variables. Central
tendency is usually interpreted by mean, median, mode, and standard deviation. Meanwhile,
Kurtosis, Skewness, First Quartile, Third Quartile, and Interquartile Range are the deeper factors
to explain the variables. All of the factors in the descriptive summary benefits to measure the
shape of the variables.
House Price in
California
Industrial
establishments
in California
Crime Rate in
California
Unemployment in
California
Mean 191.15 846778.27 1356993.273 1245672.091
Median 167.37 849875 1369224 1094277
Mode N/A N/A N/A N/A
Minimum 111.69 799863 1238243 833237
Maximum 285.20 891997 1431269 2259942
Range 173.51 92134 193026 1426705
Variance 3410.00 926472229.4 3919777289.82 233685766023.6910
Standard
Deviation
58.40 30438.01 62608.1248 483410.5564
Coeff. of
Variation
30.55% 4.61% 38.81%
Skewness 0.4077 -0.1342 -0.6908 1.5560
Kurtosis -1.0718 -1.1166 -0.3057 1.2989
Count 11 11 11 11
First Quartile 142.33 827472 1328826 932073
Third Quartile 251.37 878128 14171781332270
Interquartile
range
109.04 50656 88352 400197
Skewness which is a factor which can display the variables distribution by using the five-
number summary, the smallest value, the first quartile, the median, the third quartile, and the
largest value. If skewness equals zero, variable is the perfectly symmetrical distribution.
Otherwise, it will be one of the other two distributions, which are left skewed (negative) and
right skewed (positive). It will clearly show what distributions the variable is in the Boxplot.
#1: Real Estate Price in California
110 160 210 260 310
Housing Price Index - Purchase Only California Non-Seasonally Adj HPI
Index
Boxplot
Exhibit #6
The Boxplot is also describing the five-number summary. In the Boxplot of Housing
Price Index, it is the Right-Skewed Distribution because the skewness equals 0.4077 greater than
zero. Thus, the distance of the smallest value is close to the median than the largest value one. In
other words, in this case of house purchasing activities, most of the activities occurred on the
range from about 110 to 210.
#2 Industrial Establishments in California
806730 816730 826730 836730 846730 856730 866730 876730 886730 896730
Industrial Establishments
Exhibit #7
The number of industrial establishments is symmetric. It indicates that the mean is quite
similar to median. There are few outstanding values in this data set. The average numbers are
between 826,730 to 876,730. Therefore, the number of establishing new businesses in California
is approximately symmetrically distributed data.
#3 Crime Rate in California
1238240 1288240 1338240 1388240 1438240 1488240 1538240
Crime all of California 2000 - 2010
Boxplot
Exhibit #8
The above data reveal the distribution of the crime rate from 2000 through 2010. The
boxplot also determines the five number summary. In the box plot of crime the crime rate is
slightly negatively left skewed (skewness of -0.06908) less than 0. From the data for crime the
mean (or average) crime rate was 1,356,993 crimes, with a minimum crime rate of 1,238,243,
and a maximum crime rate of 1,431,269 crimes. 25% of the crimes occurred in the first quartile
with a value of 1,328,826 crimes, and 75% of the crimes were 1,417,178. Half of the crimes
were 1,369,224 crimes.
#4: Unemployment in California
833230 1033230 1233230 1433230 1633230 1833230 2033230 2233230 2433230
Non-Seasonally Adjusted Unemployment California #
Boxplot
Exhibit #9
The above boxplot shows that the distribution is right-skewed. Considering the data the
skewness of unemployment in California is 1.5560 which is way greater than zero. This means
the mean is greater than the median. Most of the data are comparatively small and there are
exceptionally larger data which pulls the mean to the right side of the boxplot.
Correlation Coefficient Analysis make sure when you talk about your data…………………
under your multiple regression analysis and scatter plot….I’ll finish the conclusion…just
want to pull a sentence from each of your scatter plots…please revise your scatter plots…
X= Asian Population, and Y= Real Estate, Industrial Establishment, Crime, Unemployment
The coefficient of determination measures the proportion of variation in Y (Asian
Population) that is explained by the variation in the independent variable X (Real Estate Value or
Industrial Establishments, or Crime, or Unemployment) in the regression model.
#1 Real Estate Value in California
Asian Population in CA Real Estate Value in CA
Asian Population in CA 1
Real Estate Value in CA 0.38509275 1
Exhibit #10
In Chart #4, the correlation coefficient is approximately 0.385, nearly close to 0. It means
that Asian Population in CA did not have strong relationship with Real Estate Value in CA.
#2 Industrial Establishments in California
Asian Population in CA Industrial establishments in CA
Asian Population in CA 1
Industrial establishments 0.684948369 1
Exhibit #11
The coefficient of correlation between number of industrial establishments and Asia
population in California is about 0.68. It indicates that the linear relationship between these two
variables is quite strong because the number is close to 1.
#3 Crime Rate in California
Asian Population in CA Crime rate in CA
Asian Population in CA 1
Crime rate in CA .5404 1
Exhibit #12
The coefficient of determination is .5404. Therefore, 54.04% of the variation in Asian
population is explained by the variability in the number of crime rates. Just over 50%, this weak
to modest linear relationship between these two variables exists because the regression model
has explained 54.04% of the variability in predicting the Asian population. Less than half, or
only, 45.96% of the sample variability in the Asian population is due to factors other than what is
accounted for by the linear regression model that uses the crime rates in California
#4 Unemployment in California
Asian Population Unemployment in CA
Asian Population 1
Unemployment in CA 0.7877 1
Exhibit #10
According to the data there is strong positive linear relationship between the number of
Asian population and unemployment as the value 0.7877 is very close to 1. This means that,
87.77% of the variation in unemployment is explained by the variation in Asian position.
Simple Linear Regression Analysis
Simple Linear Regression Analysis is a measure to figure out the relationship between
two variables. For example, the two variables are the dependent variable (real estate value in
California) and the independent variable (Asian population in California). It uses to identify the
sort of mathematical relationship that exists between the dependent variable and the independent
variable, and the unusual observations.
#1 Real Estate Value in California and Asian Population in California
Simple Linear Regression Analysis
Regression StatisticsMultiple R 0.3851R Square 0.1483Adjusted R Square 0.0537Standard Error 342342.2639Observations 11
ANOVAdf SS MS F Significance F
Regression 1 183656270483.1630 183656270483.1630 1.5671 0.2422Residual 9 1054784030687.5600 117198225631.9520Total 10 1238440301170.7300
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%Intercept 3916755.3668 369089.2889 10.6119 0.0000 3081817.3881 4751693.3454 3081817.3881 4751693.34538Housing Price Index - Purchase Only California Non-Seasonally Adj HPI Index2320.7357 1853.8873 1.2518 0.2422 -1873.0487 6514.5202 -1873.0487 6514.52016
Exhibit #13
From the Exhibit #13, when the level of significance is 0.05, the p-value of housing price
index is 0.2422>0.05. Thus, we can conclude that there is not a linear relationship between Asian
population in California and real estate value in California. In order to give you a big picture, let
display the Scatter Plot (Chart #5) for this two variables:
Exhibit #14
It will be represented as an equation: Y=AX+B. Y is the Asian Population in California
(Independence Data), X is the Real Estate Value (Dependence Data). A is the slope which means
that for each increase of 1 unit in X, Y will increase by A units. B is a constant number. In Chart
#5. It shows that the equation is Y=2320.7X+4E+06. In other words, Y and X is the positive
relationship due to the A=2320.7>0. In addition, R2=0.1483, which means that 14.83% of the
100 120 140 160 180 200 220 240 260 280 3000
1000000
2000000
3000000
4000000
5000000
6000000
f(x) = 2320.73573941192 x + 3916755.36675568R² = 0.148296426004184
Scatter Plot
X
Y
variation in the percentage of real estate value in California is explained by the variation in the
Asian population in California.
#2 Industrial Establishments in California and Asian Population in California
Simple Linear Regression Analysis
Regression StatisticsMultiple R 0.6849R Square 0.4692Adjusted R Square 0.4102Standard Error 270271.6223Observations 11
ANOVAdf SS MS F Significance F
Regression 1 581019552597.1430 581019552597.1430 7.9541 0.0200Residual 9 657420748573.5840 73046749841.5093Total 10 1238440301170.7300
Coefficients Standard Error t Stat P-valueIntercept -2345419.9925 2379079.0304 -0.9859 0.3500Number of Industrial establishments in California7.9192 2.8079 2.8203 0.0200
Exhibit #15
From this information, while using the 0.05 level of significance, p-value of industrial
establishments’ number is 0.02000. It is less than 0.05. Therefore, it can conclude that there is
relationship between Asian population and number of business establishments in California.
780000 800000 820000 840000 860000 880000 9000000
1000000
2000000
3000000
4000000
5000000
6000000
f(x) = 7.91916096931599 x − 2345419.99250118R² = 0.469154267709062
Scatter Plot
X
Y
Exhibit #16
From the scatter plot, the simple linear regression equation is Y = 7.9192x – 2E + 06.
Thus, the number of business establishment increases approximately as a straight line. Moreover,
from the simple linear regression showed that r2 = 0.4692, which means 46.52% of the variation
can be explained by the variation in the independent variables.
#3 Crime Rate in California and Asian Population in California
Simple Linear Regression Analysis
Regression StatisticsMultiple R 0.7351R Square 0.5404Adjusted R Square 0.4893Standard Error 251489.9156Observations 11
ANOVAdf SS MS F Significance F
Regression 1 669215702456.9300 669215702456.9300 10.5810 0.0100Residual 9 569224598713.7970 63247177634.8663Total 10 1238440301170.7300
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%Intercept 9967347.0471 1725390.4055 5.7769 0.0003 6064242.7827 13870451.3115 6064242.7827 13870451.31148Crime Rate in California -4.1319 1.2703 -3.2528 0.0100 -7.0054 -1.2584 -7.0054 -1.25841
Exhibit #17
Simple linear regression analysis is a process to figure out the relationship between two
variables. Specifically, the two variables are the dependent variable X (Asian Population in
California) of approximately (9,967,347) and the independent variable (Crime Rate in
California) (-4.1319). It is utilized to explain any sort of relationship that may or may not exist
between the dependent variable, and the independent variable, along with anything unusual. The
slope, b1, is -4.1319. This means that for each decrease of 1 unit in X(Asian Population), the
predicted value of Y(Crime Rate) is estimated to decrease by -4.1319 units. Thus the slope
represents the portion of the crime rates that are estimated to vary according to the number of
Asians in the population. The Y-intercept, b0, is (9,967,347.0471). The Y-intercept represents
the predicted value of Y when X equals 0. Interpretations of the Y-Intercept value should be
made with caution, because Y-intercept can never be zero and may be outside of the range of
relevant values.
1200000 1250000 1300000 1350000 1400000 14500000
1000000
2000000
3000000
4000000
5000000
6000000
f(x) = − 4.13192438402224 x + 9967347.04708142R² = 0.540369771416761
Scatter Plot
X Asian Population
Crim
e ra
teY
Exhibit #18
Depicted above is an example of a negative linear relationship. As X (Asian Population)
increases, the values of Y (Crime rate) are generally decreasing. An example of this type of
relationship might be the crime rate and Asian population in California. In other words, as the
Asian population increases, Crime rate decreases. This explanation can be explained also by the
equation in the scatter plot of y = -4.1319x + 1E + 07. The R-squared value of (R^2: 0.5404)
indicates a weak to modest linear relationship between these two variables because the
regression model has explained 54.04% of the variability in predicting the Asian population.
45.96% of the sample variability in Asian population is due to factors other than what is
accounted for by the linear regression model that uses crime rate.
#4 Unemployment in California and Asian Population in California
Simple Linear Regression Analysisb1, b0 Coefficients 0.5734 3646070.8850
Regression Statistics b1, b0 Standard Error 0.1495 198570.6756Multiple R 0.7877 R Square, Standard Error 0.6204 228541.2773R Square 0.6204 F , Residual df 14.7108 9.0000Adjusted R Square 0.5783 Regression SS , Residual SS 768360262154.6160 470080039016.1110Standard Error 228541.2773Observations 11 Confidence level 95%
t Critical Value 2.2622ANOVA Half Width b0 449198.0761
df SS MS F Significance F Half Width b1 0.3382Regression 1 768360262154.6160 768360262154.6160 14.7108 0.0040Residual 9 470080039016.1110 52231115446.2346Total 10 1238440301170.7300
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%Intercept 3646070.8850 198570.6756 18.3616 0.0000 3196872.8089 4095268.9611 3196872.8089 4095268.96105Unemployment in California (number of people)0.5734 0.1495 3.8355 0.0040 0.2352 0.9116 0.2352 0.91161
Calculations
Exhibit #19
The above figure enables us to develop a model to predict the values of Asian population
based on the numbers of unemployment in California. The slope, b1 is 0.5734. This means that
for each increase of 1 unit in X (Unemployment) the value of Y (Asian Population) is estimated
to increase by 0.5734. This slope represents portion of annual variation in Asian population that
are estimated to vary according to the variation in number of unemployment in California. The Y
intercept, b0, is 3646070.8830. This represents the value of Y (Asian population) when X
(Unemployment) equals to zero.
600000 800000 1000000 1200000 1400000 1600000 1800000 2000000 2200000 24000000
1000000
2000000
3000000
4000000
5000000
6000000
f(x) = 0.573411393551757 x + 3646070.88498874R² = 0.620425757647152
Scatter Plot
X (Unemployment)
Y (A
sian
Popu
latio
n)
Exhibit #20
As per the above scatter plot, the simple linear regression equation is Y = 0.5734x + 4E +
06. Therefore, the number of business establishment increases as a straight line. Moreover, from
the simple linear regression shows that r2 = 0.6204, which means 62.04% of the variation in
Asian population can be explained by the variation in the unemployment rate.
Conclusion:
Simple linear regression analysis is a key process to figure out the relationship between
two variables. Specifically, the two variables that are discussed in this research; the dependent
variable X (Asian Population in California) and the independent variable (Real Estate, Industrial
Establishments, Crime, and Unemployment). It is utilized to explain any sort of relationship that
may or may not exist between the dependent variable, and the independent variable, along with
anything unusual. 14.83% of the variation in the Asian population is explained by the variation
the percentage of real estate value in California. 46.52% of the variation in the Asian population
can be explained by the variation in the number of industrial establishments in California.
54.04% of the variation in Asian population is explained by the variability in the number of
crime rates in California. 62.04% of the variation in Asian population can be explained by the
variation in the unemployment rate in California. Our numbers, research and modeling has lead
us to find that real estate and industrial establishments were weak indicators with the Asian
population in California. Further, crime rate and unemployment showed to be stronger indicators
of a relationship with the Asian population in California.
Reference
Bureau of Labor Statistics. February 02, 2009. Employment and unemployment. Retrieved from
http://statisticaldatasets.data-planet.com/dataplanet/Datasets.html#I|1498d7f791654
Criminal justice information service division, Federal Bureau of investigation. August 20, 2012.
Crime reported: Crimes confirmed. Retrieved from http://statisticaldatasets.data-
planet.com/dataplanet/Datasets.html#I|1499cb47f3973
Federal Housing Finance Agency, November 18, 2013. House price index – Purchase only.
Retrieved from http://0-statisticaldatasets.data-planet.com.library.ggu.edu/dataplanet/
Datasets.html#I|14972e727d98
Total Industrial Establishments from the County Business Patterns by NAICS Code (1998-2002)
Dataset-0 Total State California 2003 - 2012. (2014, September 30). Retrieved from
http://statisticaldatasets.data-planet.com/dataplanet/Datasets.html#I|14977e70e0132
Total Establishments (NACIS) (1998-2002) from the County Business Patterns by NAICS Code
(1998-2002). (2014, September 30). Retrieved from http://statisticaldatasets.data-
planet.com/dataplanet/datasheet/Datasheet.jsp?sessionID=77602ba6-0587-420c-841c-
0ef20a414491&viewID=TREND%7COrg5t%7C%7CTimeUnit0%7COrg5LIST
%4006075%2COrg4LIST%4006%7CUserCube39Dim0%400%7C%7C
%7Cfalse&sid=28edea0d80c150%2431a¶m=7fuKsUAw6JIlRs8qM3vA1Bdiq7lCp
OL5mtnzyvW0WnrNETvC0ue4cQD5tNSYAZK_UphPOiR_JbWZT7Vy3WhQZ1fmiHF
AQJbcjmbHy_pE0wsNT4ScNu1bgTLUuyvTM33bLmudAFlZuUmYpf7xbk0yJ1g-
bMFSa01zjrutFVzMvAF2lHg0dbb64FBhOB6QfkFe6eTAJYtI337sTZPpmnhENhlD8Zq
nKfnWdPuN-
HL7x_idge5nhPcCN0Coo8rnDnSSFduuHIWDfDbntTsBSCbTOidNu8yG0SjsY6I_pdq
MB9fizUq955_Gw12TnNR23BGAQ8lsTgO6htGJp1kvbyqWSg%3D
%3D&customData=-1%7C0%7Ctrue
United States Census Bureau, November 07, 2013. Population estimates – Detail (1980 -
Present). Retrieved from
http://statisticaldatasets.data-planet.com/dataplanet/Datasets.html#I|1499cf172c575