Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
JAVAN MULENGA CHITEMBO
AUD16377BEC24087
COURSE NAME
APPLIED STATISTICS FOR ECONOMIC ANALYSIS
STUDENT PROFILE
THE APPLICATION OF STATISTICS IN DECISIONMAKING
LEVEL OF DEGREE
DOCTORATE
MARCH 2014Page 1 of 46
ATLANTIC INTERNATIONAL UNIVERSITY – AIU
HONOLULU HAWAII – USA
TABLE OF CONTENTS
I. Definitions.....................................................................................................3
II. Introduction...................................................................................................3
III. Statistics.......................................................................................................3
IV. Descriptivevariable......................................................................................3
V. Inferentialstatistics.......................................................................................3
VI. Sampling and datacollection........................................................................3
Page 2 of 46
VII. Probabilitydistribution...................................................................................3
VIII. Confidenceintervals.....................................................................................3
IX. Hypothesistesting........................................................................................3
X. Correlationanalysis......................................................................................3
XI. Simple regressionanalysis...........................................................................3
XII. Conclusion....................................................................................................3
Page 3 of 46
Introduction
On developing this compendium on statistics in economics, I
am particularly interested in appreciating the value of
applying statistics to decision making in our daily given
situations. It is thus undeniable that statistics are all
around us, gaining sound knowledge of the mechanics of
statistics moves us to a level where mankind can efficiently
command resources in order to achieve optimal results.
Although, statistics is one of the new fields that have
gained prominence, over the last few decades, in the
academic realm, it has attracted significant interest from
many scholars around the globe. To that effect, statistics
should be viewed as a science of numeric and non-numeric
which is applicable to all known fields, and as the tool
that helps us to gain deep understanding of the dynamics of
any given event.
Page 4 of 46
Singpurwalla, M. (2013) defined statistics as a science of
data. He further contended that it involves collecting,
classifying, summarizing, analyzing, and interpreting
numeric information.
Shayid, M, (2013) epitomized that statistics is that branch
of science which deals with; collection of data, organizing
and summarizing the data, analysis of data and making
inferences or decisions and predictions. He further argued
that the last point is the objective of statistics that is
making inferences about the population based on information
contained in a representative sample taken from that
population.
The learning and application of statistics in our daily
situations is mirrored from two angles, this is why many
pundits on this subject have distinctively dissected the
subject into two major branches that is descriptive
statistics and inferential statistics. It is in this light
that the discourse of this paper shall follow a similar
paradigm.
However, before I proceed with the above, it is important
that I define the commonly used terminologies in statisticsPage 5 of 46
in order to provide a solid background upon which the
audience of this paper can easily be acquainted with since
they shall appear most oftenly through out the dialogue.
Population: in statistics the word population does not
necessarily refer to people only as it is used in its
lateral meaning. Instead, in the statistics’ world, it is
referred to as a set of objects or elements that share
certain common properties and it is of interest to our
study, for example, a population of students of economics at
Atlantic International University.
Sample: This is a representative subset of a population. A
sample is very useful in situations where it apparently
appear to be practically impossible to deal with all the
elements in a population, for instance, a committee of five
(5) students in an economics class forms a sample of that
class.
Experimental unit: an individual object or element in a
population or sample from which data is being collected, for
instance, a student in a committee of an economics class
Page 6 of 46
Variable: underlines the characteristics or property of each
experimental unit. Emanating from the example above, the
variable will be economics since the population or sample is
about students who are studying economics at AIU.
Descriptive statistics
The focus of descriptive statistics is to describe a data
set. It achieves this by exploiting numerical and graphical
methods to unearth patterns that lie in a data set. The
revealed information is summarized and presented in a very
convenient manner for other users to make educated
decisions.
In short, descriptive statistics includes both numerical
measures and graphical display of data.
Since we already know what a population is, in certain
situations it is highly impractical to collect data from all
experimental units in a population. When we are in such a
given situation, we turn to consider a smaller and highly
representative part of that particular population we are
interested to study. This implies that we develop a sample
which becomes a subset of the population, for example, a
Page 7 of 46
committee of 5 students selected randomly from an economics
class.
Many scholars refer to these elements in a population or
sample differently while meaning the same thing. Some calls
them observations, measurements, scores or just data. In
this paper I will call the elements as measurements.
Shayib, M. (2013) argued that there are four measurements.
He called these as follows
Nominal data: this is a form of measurements in statistics
which does not follow any natural ordering. In other words,
a statician cannot perform arithmetic calculation on nominal
data the good examples of such data are names, labels,
gender or even major at college.
Ordinal data: At least this form of data can be arranged in
a particular order but still no arithmetic is performed on
ordinal data.
Interval data: Quite similar to ordinal data, however,
subtractions maybe performed on interval data and it has no
natural zero
Page 8 of 46
Ratio data: Quite similar to interval data, but natural zero
exists in a ratio data and also division is performed on
ratio data.
To this end, it is important to go a step further and
discuss different types of variables since they play a
crucial role in statistical research. You recall at this
juncture, that I defined variables in my earlier remarks as
characteristics or properties of individual measurement or
experimental unit in a population or sample. I must
emphasize that these variables comes in various values, some
can be in the form of numbers while others can be in
categories. It is with this respect that variables can be
analyzed from two perspectives, that is, category or
qualitative and numbers or quantitative variables.
Now let me explore them distinctively, beginning with
qualitative variables, but before I proceed any further on
this matter, I would like to make a clarification that some
scholars prefer calling them as qualitative data and
quantitative data while other scholar call them qualitative
variables and quantitative variables. This marked difference
in seen in many books for example Singpurwalla (2013) refers
Page 9 of 46
to them as variables while Shayib, M. (2013) refers to them
as data, also Lewis et al,. (2005) called them as data. No
one should torment their brains they all mean the same
thing. So, I have also adopted the term variable and I shall
proceed as such.
Category or qualitative variables
These are variables that cannot be measured on a natural
numerical scale but they can only be listed into one or more
categories. They have no order and the best way to study
qualitative variables is by the use of graphs. For example,
in an attempt to know the gender balance of students at AIU,
names of students can only be listed into categories of
Males/ Females. The most recommendable graphs for describing
qualitative variables are:
Pie chart
Bar graph
Pereto chart
Page 10 of 46
The Pie chart showing male and female students
Quantitative variables
Represents data that can be measured on a natural occurring
scale, for example time it takes to fly from Lusaka to
Florida, time it takes to run a kilometer, time it takes to
finish a PhD, the height of human beings, the length of a
river, and so forth, all these can be measured
quantitatively on a natural occurring scale. Most
recommendable graphs to use when presenting quantitative
data are;
The dot plot
The stem and leaf display
The histograph
Page 11 of 46
The presentation of information graphically and numerically
helps us to communicate our research findings in an accurate
and unambiguous fashion to others users. Absolutely,
arithmetic operations are carried out in quantitative
variables and this provides a far more meaningful results.
You can now agree that nominal data and ordinal data can be
taken care of by qualitative variables, interval and ratio
is measured by quantitative variables.
To take the debate further, quantitative variable is
classified into two types these are discrete or continuous
variables.
Under discrete variable, quantitative variable assumes a
finite or a countable set of values. On the contrary,
discrete variable does not embrace every possible value in
an interval on a real line. An example of a discrete
variable would be a number of children a family can have.
Consequently, continuous variable is a exact opposite of
discrete, it assumes an infinite number of values between
two points in an interval on a real line. For instance, the
awarding of GPA from 0.0 to 4.0 is a continuous variable
Page 12 of 46
Inferential statistics
The other type of statistics is inferential statistics. When
discussing descriptive statistics I alluded that its major
goal is to describe a data set. In inferential statistics,
our ultimate goal is to draw a conclusion about a population
based on the sample of experimental units from that
particular population. While in descriptive statistics we
use mostly numeric and graphical techniques, in inferential
statistics we exploit the hypothesis testing techniques.
Singpurwalla (2013) argued that although there several
similarities between descriptive statistical problems and
inferential statistical problems, there exists some
differences, he listed the following;
InferentialStatistics
DescriptiveStatistics
1 Define the population of interestDefine the problem or sample of
interest2 Select the variables select the valuable3 Select a sample Select the table, graph or numeric4 run the test identify patterns5 generalize the results
The differences betweeninferential and descriptive statistics
4 run the testidentifypatterns
Page 13 of 46
5 generalise the results The differences between inferential and descriptive statistics
SAMPLING METHODS
Data collection and sampling
Mankind’s life depends on data, every day we gather data
because it is a crucial raw material input to our decision
making. We make decisions about what to buy, where to go for
vacation, how is the traffic situation? How the weather will
be, what will be the effect following the government policy
and so forth. All these we need data to help us make well
informed decision.
In this section I would specifically like to address how
these data are collected. Data is not information, but
information comes from the analysis, summarizing and
evaluation of data, in short, information is interpreted
data, pure and refined data is called information and the
two cannot be used interchangeably therefore care must be
Page 14 of 46
taken when using this two terminologies in order to show the
distinction.
To that effect, we gather data from various sources with the
use of data collection tools, such as survey.
Data collection is classified into two categories;
Primary data collection
Secondary data collection
Thus, interrogating sources in both categories is crucial to
gathering good data that once refined will enable us make
well informed decisions.
While the names to these categories maybe misleading to many
people, it is the secondary data collection exercise that is
conducted first before venturing into the primary data
collection exercise.
Secondary data collection is data done by other researchers
for a different purpose. The advantage is that it is readily
available and cost less to get, however, in this time and
age of internet, secondary data tends to be bulk in nature
posing danger of data overload and the fact that it was
Page 15 of 46
generated for a different purpose. Such examples of
secondary data, is one that is generated by A. Nielsen,
Satch & Satch, blackdot, government departments, newspapers
and so forth.
Primary data collection is virgin data that has not been
collected and is specifically being collected for the
purposes of the matter at hand. Here we use a survey; a
survey is a way of collecting data from experimental units.
Below are means through which primary research is conducted;
Focus group
In-depth interviews
Personal interviews: Is another method of collecting data.
It is mostly used when subjects are not likely to respond to
other survey methods such as online. They are in-depth and
very comprehensive.
Phones
Internet
Postal mails
Sampling method
Page 16 of 46
Singpurwalla (2013) said that sampling is the use of a
subset of a population to represent the whole. There several
methods used in sampling exercises, some have been discussed
here.
Measurements or elements
A sample is a subset of experimental units of the whole
population
Sampling methods is divided into two approaches, that is,
probability and non-probability sampling methods. Let us
start with probability sampling methods,
Under probability sampling method, all experimental unit
stand an equal chance of being sampled. There deferent types
of probability sampling and these are
Page 17 of 46
Population
Sample
Random sampling: data subjects have equal chances of
being selected and sampled.
Stratified sampling; Is a subdividing into clusters
based on a specific characteristic
Systematic sampling : referred to as 1 in every Kth
sampling
Non-probability sampling
This method is used normally where random sampling proves to
be not feasible. Likewise, there different types of non
probability sampling, these are;
Convenience sampling
Judgment sampling
Snowball sampling
Saunders, Lewis and Thornhill (2005) highlighted the
following as sampling techniques;
Page 18 of 46
SAMPLING METHODS
Probability Non-probabilitySimple random QuotaSystematic Purposivecluster SnowballMulti-stage Self selection Convenience
Sampling methods and types
Saunders, Lewis and Thornhill (2005) they further argued
that probability sampling is most commonly associated with
survey based research where one need to make inferences from
the sample about the population to answer research question.
Under this, the chance or probability of each case of being
selected from the population and is equal for all cases. But
is a non-probability sampling each case chance or
probability of being selected from the population is not
known and is impossible to answer research questions which
require statistical inferences. Thus, non-probability is
useful in case study research.
Measure of central tendency
This measure is applicable to quantitative data and it seeks
to measure the tendency of data to centre about certain
numeric values. There are three ways of measuring central
tendency, these the mean, median and modePage 19 of 46
Mean (average)
The mean measure of central tendency is the sum of
measurements divided by the total number of measurements
(elements). The table below shows the mean of Gift Chitembo’
High school results
= average (A1 : A4)
When you look at
the mean, you will
notice that
actually what most people refer to as average they are in
essence referring to the mean. There can only be one mean.
Median
This is a middle number of the quantitative data set when
the measurements are arranged in ascending or descending
order. Excel formula is
=Median ( )Page 20 of 46
Mean = SUM (40 + 48 + 56 + 62 +
70 + 78 + 87 + 90)
= 531 / 8
Mean = 66.38
The median number divides the data set into half 50% below
the median and 50% above the median. The median is a better
representation of what the data actually looks like. If the
data set consists of the even number like in the example
above, the two middle numbers are added and divided by two
in order to get the median. On the other hand if median has
an odd number it is easier because you just need to go for
the middle number for example
2 4 6 8 10
= Median (6)
The mode
Page 21 of 46
90 40 56 70
78
87 62 48
40 48 56 62 70 78 87 90
Median
(62+70) / 2
= 66
This is the measurement that occurs most frequently in the
data. The mode is only measure of central tendency that has
to a value in the data. The excel formula is
=mode (2246), it appears three times more than the others.
Measures ofvariability
Gould, R. (2014) contended that as one of the key component
of statistical thinking, variability deserves more
prominence in training. Moore (1990) as presented in Gould
(2014) defined variability as the examination of patterns
and striking deviations from those patterns. Variation is
that which not pattern. Singpurwalla (2013) contended that
variance is a key concept in statistics.
From the above arguments, we have seen that variability
plays a pivotal role in statistical analysis. In this
manner, the job of staticians is to understand the factors
Page 22 of 46
1990
2000
2246
1981
Values showing weekend (Saturday)advertising per square column centimeters of the Zambia Daily Mail newspaper
2340
1960
2240
2246
2246
1958
that cause variance and how it can be controlled. For
example, in HIV research, scientists have been amused by the
desire to understand why some people do not get HIV when
exposed it at first exposure while other do at first
exposure? Or Staticians tries to explain exactly what causes
these differences and develop drugs that may make all people
resistant to HIV virus.
There many ways of measuring variability but the commonest
ones are;
Variance
Standard deviation
Under the variance measure, we consider the average squared
distance of each point from the mean (Singpurwalla, 2013).
In other words we can say that variance is a measure of how
far data are spread out from the average (mean).
Excel formula to calculate variance is = ver. p(XX) XXX
On the other hand, standard deviation measures how much, on
average, each of the values in the distribution deviates
from the centre of the distribution.
Page 23 of 46
In excel this can be worked as follows
Excel formula to calculate variance is = stdev.p( XXX) $XXX
In both example, you can that Standard deviations carries a
unit of the data set it is due to this fact that people like
discussing data in terms of standard deviations. To sum up
on the measure of variability, standard deviations is a
simply the square root of the variance. On the other hand
the variance does not carry any unit of the data because it
is merely the measure of how spread data is thus it is unit
free.
The application of variance and standard deviations in
statistical analysis is very important in gaining insight
into the data set. This can be achieve when data is analyzed
from the angle of combined mean and standard deviation, the
combination adds up impetus to the statician to delve into
more underlying insights of the data set. But again,
analyzing further literature on this subject, we find that
standard deviation is not immune to outliers and normally
when they exists, the standard deviation get inflated,
notwithstanding, standard deviation is a positive number.
Probability, Random sampling and distributionPage 24 of 46
Earlier in this compendium I defined inferential statistics
and under sampling methodology, I did discuss probability.
But here, I want to specifically present probability and its
invaluable place in statistics. If we appreciate that
variability concept is key to statistics, probability is a
foundation of inferential statistics.
Consequently, failure to develop solid understanding of the
concept, it will relegate any work we shall do in
inferential statistics to yielding nothing. Thus, at this
point I am simply delving into the intuitive that lies
behind probability as it relates to inferential statistics.
What is probability? It is likelihood, prospects,
possibility. The probability of something to occur is its
chances, possibility, likelihood and prospect of it
happening. For example, the probability of an imminent
market crush is its likelihood of happening. Here, we see
that probability is somehow futuristic. But how can we
enhance our understanding of probability especially in light
of inferential statistics?
Before I define it, it is important that I hold it here and
quickly introduce a new concept in line with the subject atPage 25 of 46
hand. Uncertainty, for us to have a sound grasp of
probability, it is significantly vital to acquaint ourselves
with the concept of uncertainty. Uncertainty can be defined
as the study of something we do not know. Unfortunately, in
our daily lives we all encounter uncertainty. This is due to
the fact that we assess the future differently.
So some investors would deny that there is going to be a
market crush while other would say the market crush is
imminent. Since uncertainty is the study of something we do
not know, this sounds rather punitive to our minds, but even
the future must be quantified.
Singpurwalla (2013) spiced it up that by quantifying
uncertainty we are attempting to discuss the matter in a
more precise terms, but, what precise this mean? It is hard
to say without putting umbers behind the statements.
Scholars have most oftenly used probability as the measure
of uncertainty. To this end, I confidently argue that the
definition for probability is a measure of uncertainty.
Probability is the likelihood that something will happen and
it is expressed in percentage format.
Page 26 of 46
What qualifies the discussion of uncertainty is the idea
behind statistical experiment. By definition as statistical
experiment is a process or act of observation that leads to
a single outcome that cannot be predicted with certainty
where outcome is known as sample event. A collection of all
sample event is known as sample space, which is denoted in
set annotation as a set containing the sample event
S:
E2......E2
Then we assign a probability to each sample event in our
sample space.
This is done based on three factors
Previous knowledge of the experiment
Subjectively assigning probabilities
Experimentation
Need examples here
The technical of probability concept
Union and intersection of events
Page 27 of 46
Normally these events are categorized in two ways, that is,
unions and intersections. A union of two events A and B is
the event that occurs if either A or B or both occur on a
simple performance of the experiment, denoted by “U”. In
this case, AUB consists of all sample points that belong to
A or B or both.
The case of intersection occurs when A and B occurs on a
single performance of the experiment and it is denoted by A
intersect B the intersection consists of all sample points
that belongs to both A and B.
The additive rule of probability
This rule simply tells us how to calculate the probability
of the union of two events. The rule states that the
probability of the union of A and B is the sum of the
probability of events A and B minus A and B
Conditional probability
An example of condition probability is what would be the
probability of interest rates dropping, given that Bank of
Zambia ha just cut rates? The key word here “given”, that
Page 28 of 46
is, the probability of some event A given that event B
occurs.
= P(AIB) = P(AΩB)/ P(B)
Multiplicative of probability
Under this rule, we are able to solve probability of the
intersection of two events by multiplying both sides with
the denominator, P(B)
=P(B)*P(AIB) = P(AΩB)
= P(AΩB) = P(B)*P(AIB)
Independence of event
The key word here is “independence”, In this regard, event A
and becomes independence if if the occurrence of B does not
alter the probability that A has occurred
= P(AIB) = P(A)
= P(AIB) = P(B)
The benefit of independence of events is that the
calculation is much easier.
= P(AΩB) = P(B)*P(AIB)
Page 29 of 46
Independence variables are
= P(AIB) = P(A)
= P(AΩB)= p(B)*P(A)
In conclusions, the above rules enables us to write out a
sample space and able to assign probabilities.
Random variables and probability distribution
If previously I stated that sample space is a collection of
all sample events for a statistical experiment, there
situations that exists where it is impractical for us to
draw up a sample space and be able to assign probabilities
to sample events. When we are faced with such a situation,
staticians resort to statistical random variables in order
to solve the problem.
Singpurwalla (2013) defined random sampling as being able to
assume numeric values associated with the random outcome of
an experiment where one and only one numeric value is
assigned to each sampled point.
There are two known random variables these are discrete
random variable and the continuous random variables
Page 30 of 46
Earlier in this compendium, I discussed the two variables
but just to recap on them, discrete random variable is on
that assumes a finite value, such as 0,1,2,3,4 and so forth
thus you can see that they are often counts. While the
continuous random variable is the exact opposite of discrete
random variable which assumes an infinite value that are
uncountable for example, height, weight and time are good
examples of continuous random variables.
Let us go a step further and look at first the discrete
random variable and later the continuous random variable as
we expound the concept of probability distribution.
1. Probability distribution of the discrete random variable
You recall that the discrete random variable specifies the
probability associated with each possible value the random
variable can assume. Thus, the probability distribution of
the discrete random variable is represented in either as a
table or as a graph. But our dialogue on this matter will be
incomplete without a good analysis of the binomial
distribution.
Binomial distribution
Page 31 of 46
Before going any further, it is important to remember again
that probability is the foundation for all inferential
statistics. Binomial distribution is well known discrete
probability distributions that many experiments have been
modeled after (Singpurwall, 2013). When conducting an
experiment, there are five characteristics of a binomial
distribution such that if the experiment meets them then we
that it has followed a binomial distributions
To do so we use the formula
P(x=x)= n Px(1-p)n-x
X
To calculate the mean and variance of a binomial probability
distribution we use the formula below
Mean = n*p
Variance = n*p*q
Standard deviation = n*p*q
N stands for number of trails and p for the probability of
success
Page 32 of 46
As you can see from the standard deviation formula, it
justifies my assertion that standard deviation is a square
root of the variance, thus by recognizing the properties of
binomial experiments we can calculate specific probabilities
as well as the mean and variance (Singpurwalla, 2013). In
this way, solving problems related to statistical
experiments can effectively be dealt with by recognizing
probability distribution.
Normal distribution
This is another famous statistical distribution in use among
staticians. If the binomial distribution is a discrete
probability distribution, the normal distribution follows a
continuous probability distribution, for example, the height
of Lola, the level of hemoglobin in blood and so forth which
are all normally distributed. Therefore you can deduce that
normal distribution applies to random variable that are
measurable but not countable, for instance, one cannot count
height, or the level of blood such as o,1,2,3,4 as in the
binomial probability distribution. In the discrete discourse
I said we can calculate the mean, variance and the standard
Page 33 of 46
deviations, in the normal distribution, it is defined by its
own mean and standard deviation.
Further questions can be posed here, for instance, How
normal probabilities using normal distribution be
calculated? It is important to understand that the standard
normal distribution which is the most inference distribution
in statistics is very helpful in solving the problems that
arises when using normal distributions while solving normal
probabilities. Standard normal distribution has a mean, 0.
On the other hand, most normal distribution do not have the
mean 0 nor standard deviation of 1, worry not because by
standardizing the variables, this problem is being taken
care of.
To this end, recognizing probability distribution is
extremely helpful in solving problems related to statistical
experiments.
The sampling distribution
Page 34 of 46
Let us deal first with two issues here in order to put the
discussion into perspective, that is, a parameter and a
sample statistics
A parameter is defined as the numerical distributive measure
of a population. The fact that its measure is on the basis
of observations in the population, its value is unknown and
thus we estimate it.
Sample statistics: this is numeric distributive measure of a
sample that is calculated from the observations in the
sample, for instance, mean, variance etc.
To that, effect, can confidently infer that the sampling
distribution of a sample statistics, calculated from a
sample population of n measurements is the probability
distribution of the statistics
This discourse takes us to another yet important matter, the
sampling distribution of the mean of the data set. In order
to us to build a strong grasp of the concept, it important
that first appreciate its properties.
The first one states that: if a random sample of n
observation is selected from a population with a normal
Page 35 of 46
distribution, the sample distribution of x-bar a normal
distribution.
The second one states that for sufficiently large (n>30)
random sample, the sampling of x-bar will be approximately
normal with mean mu and standard error of sigma/sqrt(n). The
larger the sample size, the better will be the normal
approximation to the sampling distribution of x-bar.
The central limit theorem here is very crucial as it is in
statistics. So far, we have seen that we always sample from
a population whose underlying distribution we do not know,
central limit theorem states that if the sample size is more
than 30, we obtain a sampling distribution which is
approximately normal.
Confidence intervals
In statistics, the perceived level of confidence in the
information we generate is crucial to decision making. In
the previous dialogue, we learnt that the central limit
theorem stated that the population distribution has a mean
and a standard deviation. For sufficiently large the
Page 36 of 46
sampling distribution of the x-bar has mean and standard
deviations.
The idea behind the confidence interval is that instead of
estimating the parameter by a point estimate, a whole
interval of likely estimates is given. The confidence level
gives the probability that the interval will capture the
true parameter value in related samples. So far, a given
sample, mean, standard deviation, sample size and a
specified alpha level, we can calculate the confidence
interval in excel. Therefore, it gives us the margin of
error, we add or subtract from the sample mean to get the
confidence interval. Furthermore, the confidence level is
the confidence coefficient expressed as a percentage.
The proper deduction of confidence interval includes the
idea that the numbers were arrived at by a process that
gives the correct result 95% of the time and key to this is
the philosophy of repeated process.
In an event that you have a sample size of less than 30 it
simply means that you have less information upon to base
your conclusions about the mean. In this case the central
limit theorem can suffice. To that effect, a normalPage 37 of 46
distribution to make estimates about the margin of error
cannot be relied upon. When faced with such situations, the
t-distribution the only answer to negotiate statistical
hiccup.
The t-distribution
In conclusion confidence interval measures a population
parameter. This consists of a point estimate with some
measure of reliability that is estimate +-margin of error.
Hypothesis testing
What is statistical hypothesis? A statistical hypothesis is
a belief about a population parameter such as its mean,
proportion or variance. Thus, a hypothesis test is a
procedure which is applied to test the belief. For example,
we may believe that average weight loss for a population of
dieters is equal to 2lbs in one week. We need to run a
hypothesis test to prove whether this is true of false.
Although the steps the calculation for hypothesis testing is
the same as the one for confidence interval, there is a huge
difference in what we want to accomplish with hypothesis
testing than confidence interval. Furthermore, in confidence
Page 38 of 46
interval we do not posses any knowledge of what the
population parameter is and we rely on the data to estimate
it. While in hypothesis test we use data to confirm a priori
belief about it.
The null hypothesis and the alternative hypothesis
To every hypothesis test, there is a null hypothesis and the
alternative hypothesis. The null hypothesis is a statement
about a population parameter being equal to some value. It
actually represents that which is being tested. It is worth
mentioning that the null hypothesis carries a equality sign
always to underlie what the belief is and you want to test
it.
On the flip side of the coin, is the alternative hypothesis,
if the null hypothesis turns out to be false, absolutely the
alternative hypothesis is accepted thus alternative
hypothesis becomes a true alternative hypothesis. In other
words, alternative hypothesis is the exact opposite of the
null hypothesis. The alternative hypothesis carries sign
such as <, >, or an inequality sign therefore care must be
Page 39 of 46
applied when denoting your conclusions in alternative
hypothesis.
One or two side hypothesis
The above hypotheses occurs as follows
In the case of one sided hypothesis will happen if the
statician poses no priori of expectation of the outcome of
the test or indeed there no concern in understanding whether
the alternative hypothesis will be higher or lower. Then you
will get a two sided hypothesis carrying an inequality sign.
But when you are interested in knowing whether it is higher
or lower in the testing, the outcome will be a one sided
hypothesis and will have signs like <, >
The test statistics
So far, it appears that there is no end to statistical
creativity, there is another very important component to
hypothesis testing which is the test statistics, this one is
premised on the statistics that premises the parameter. Z=X-
u/root n this random variable has the standard normal
distribution N(0,1)
The P-valuePage 40 of 46
This is another very important concept to our understanding
of hypothesis testing in statistics. Calculating and
interpreting the p-value. The test of hypothesis quantifies
the chance of obtaining a particular random sample result if
the null hypothesis were true. It is a way of assessing the
accuracy or how believable the null hypothesis is given the
evidence provided by the sample mean.
A small P- value implies that random variation due to
sampling is not likely to account for the observed
difference. In this case, a small p-value simply signifies
that we reject the null hypothesis. But how small is small?
In many time , a p-value of 0.05 is considered significant.
So if the p-value is equal or less than the significant
alpha, then we reject the null hypothesis in favor of the
alternative hypothesis.
If the p-value is greater than the significance level then
fail to reject the null hypothesis.
Hypothesis of the mean
How do we test hypothesis of the mean based on the sample
random sample size from a normal population with unknown
Page 41 of 46
mean and known standard deviation? Here we have to rely on
the population of the sampling deviation. The P-value is the
area under the sampling distribution for values at least as
extreme in the direction of the alternative hypothesis as
that od our random sample.
In this case, we first use the test statistic to calculate
the z-values and then use excel to find the p-value
associated with the value.
The following are steps for conducting a hypothesis testing
State the null and alternative hypothesis
State a significance level
Calculate attest statistic
Find the p-value for the observed data
State your conclusion based on comparing the
significance level to the P-value
There is a difference between practical significance
statistical significance. Statistical significance only
shows that the observed effect is due to random sampling or
not. This does not mean that it practically important . With
Page 42 of 46
a large enough sample size, significance can be reached even
for the tiniest effects.
Correlation and regression
These are different ways of studying the relationships
between two variables. The most basic way of doing this is
by using a scatter plot graph and then observes for an
overall pattern and study the deviation from the pattern.
Put differently, it is the study of the relationship between
quantitative variable measured on the same individual. Here
we are referring to dependent and independent variables
In the scatter plot the independent variables are plotted on
the x-axis and the dependent variables are plotted on the y-
axis. Every individual in the data appear as a point in the
plot. Once this is achieved, our interest is to study the
relationship between the two, how is one affected by the
changes in the other.
Therefore, dependent variables are defined as those which
represent the effect being tested.
Page 43 of 46
And the independent variable represents the inputs to the
dependent variable or the variable that can be manipulated
to see if they are the cause.
The use of a scatter plot helps us to understand trends in
the data and also pinpointing out those points that are not
obeying the pattern.
Furthermore, with the scatter plot we can as well identify
the form, direction and the strength of that association
between two variables. The graph helps us to answer
questions such as
Is the form linear or curved?
Is there no pattern?
Is the direction positive or negative?
Or there is no direction at all
How closely do the points fit
Correlation
Correlation is denoted ‘r’ and is the most widely accepted
measure of association between two variable. In this
respect, correlation can be defined as the measure of thePage 44 of 46
direction and strength of a linear relationship between two
quantitative variables. Correlation is always between -1 to
1 from this one can quickly deduce that a negative value
symbolizes that the association between the variables is
negative. Likewise, a positive value of the association
between the variables means that it is positive.
Excel formula is correl (x,y)
Where x, y are the arrays of data you wish to calculate the
correlation on. Correlation does not concern with dependent
and independent variable so in excel it is not necessary to
belabor making the distinction. However, as one variable
grow the other one also grows thereby rendering the
relationship strong.
Simple linear regression
We can see that correlation only tells us about the
direction and strength of the linear relationship between
two quantitative variables. But is also important to
understand how both variables vary together. For example
what is the rate of decrease in the other is increasing?
Linear regression is one way we can quantify this.
Page 45 of 46
Here drawing up a linear regression line is very important.
The line describes how the dependent variables changes with
respect to an independent variable. Remember that under
correlation dependent and independent variables were not a
concern of correlation, but here the tow variable are
extreme crucial. It is with this line that statisticians are
able to explain the changes in a dependent variable in terms
of the independent variable and also to predict the value of
the independent variable for a given dependent variable.
In conclusion, the intuition behind the regression line is
that it attempts to quantify the relationship. By putting a
straight line through the data in a way that it minimizes
the sum of the vertical distances between the points of the
data set and the fitted line. The best line is the one that
minimizes these distances. The key to regression is proper
interpretation of the slope and the intercept of the line.
The interpretation of the intercepts is what the value of
the dependent variable would be if there was no effect of
the independent variables
Page 46 of 46