Applied Statistics En

APPLIED STATISTICSEXAMPLES IN EXCEL AND SPSS

1

CONTENTS

I. Descriptive statistics ...................................................................................................4What is Statistics? ..........................................................................................................4Scales of measurement...................................................................................................4Discrete and continuous variables .................................................................................5Data collecting ...............................................................................................................5

Census ........................................................................................................................6Sampling ....................................................................................................................6

Types of sample .............................................................................................................7Simple random sample...............................................................................................7Stratified sample ........................................................................................................8Cluster sampling ........................................................................................................8Quota sampling ..........................................................................................................8Systematic sampling ..................................................................................................9

Calculating a Sample Size .............................................................................................9Frequency distribution ...................................................................................................9

Class intervals ..........................................................................................................22Outliers.....................................................................................................................30

Data presentation: tables, diagrams and graphs...........................................................30Descriptive statistics ....................................................................................................42

Measures of central tendency...................................................................................43Measures of dispersion ............................................................................................43Shape of distribution ................................................................................................45

Symmetry or skewness ........................................................................................45Kurtosis ................................................................................................................46Modality...............................................................................................................46

Measure of concentration.........................................................................................47

II. Empirical versus appropriate theoretical distributions (approximations withbinomial; Poisson, hypergeometric or normal distribution) ........................................67BINOMIAL DISTRIBUTION.....................................................................................68

Probability distribution of a binomial random variable ...........................................69Characteristics of the Binomial distribution ............................................................70

POISSON DISTRIBUTION........................................................................................80Probability distribution of Poisson random variable ...............................................80Characteristics of the Poisson distribution...............................................................84

HYPERGEOMETRIC DISTRIBUTION ....................................................................93NORMAL DISTRIBUTION .......................................................................................95

Roles for standardized normal distribution..............................................................97Characteristic intervals for normal distribution .......................................................98

STUDENT t-DISTRIBUTION..................................................................................111CHI-SQUARE 2 DISTRIBUTION .....................................................................113F DISTRIBUTION ....................................................................................................115LOGNORMAL DISTRIBUTION .............................................................................116EXPONENTIAL DISTRIBUTION...........................................................................119GAMA DISTRIBUTION ..........................................................................................121


2

APROXIMATIONS FOR BINOMIAL, POISSON AND HYPERGEOMETICDISTRIBUTION WITH NORMAL DISTRIBUTION.............................................123

III. Inferential statistics: Estimation theory and hypothesis testing...........................124INFERENCE..............................................................................................................124THE DISTRIBUTION OF THE SAMPLE MEANS ................................................125CONFIDENCE INTERVAL FOR THE POPULATION MEAN.............................125

Standard deviation from population is known .......................................................125Standard deviation from population isnt known...................................................126

CONFIDENCE INTERVAL FOR THE POPULATION PROPORTIONS .............132CONFIDENCE INTERVAL FOR VARIANCE IN POPULATION .......................134HOW TO DETERMINE SAMPLE SIZE ACCORDING TO SAMPLE ERROR? .137

Determining sample size for estimating population mean.....................................137Determining sample size for estimating population proportion ............................138

HYPOTHESIS TESTING .........................................................................................140Regions of rejection and non-rejection ..................................................................141Risks in decision making process ..........................................................................142Procedure for hypothesis testing............................................................................142Hypothesis for the mean ........................................................................................142 known ............................................................................................................142 unknown, small sample .................................................................................143 unknown, large sample..................................................................................144

A two sample test for mean ...................................................................................150A two sample test for variances .............................................................................154Testing differences between arithmetic means of more than two populations on thebasis of their samples - analysis variance ANOVA...............................................162Chi-square ( 2 ) test ..............................................................................................167

Test for differences between proportion for populations...................................176Test adequacy of approximations (goodness of fit) ...........................................177

Kolmogorov-Smirnov test .....................................................................................179

IV. REGRRESSION AND CORRELATION ANALISYS ......................................182Aim ............................................................................................................................182Basic aspects ..............................................................................................................182Scatter plot ...................................................................Error! Bookmark not defined.Line of Best Fit (Regression Line).............................................................................187The Correlation Coefficient .......................................................................................188The Coefficient of Determination..............................................................................190Interpretation of the size of a correlation ...................................................................190The standard error of estimate and the correlation coefficient ..................................192Calculating the Equation of the Regression Line for two variables ..........................193Prediction or forecasting ............................................................................................197Spearmans rank correlation coefficient ....................................................................198Statistical testing (t test, ANOVA) ............................................................................201Overview example for simple regression model with SPSS .....................................202MULTIPLE REGRESSION MODEL.......................................................................209

The general multiple regression model..................................................................209Measures for quality of multiple regression model ...................................................210Statistical test (t test, ANOVA) .................................................................................211Indicator dummy variables .....................................................................................215


3

Simple model with dummy variable ..................................................................216Example indicator variables as the regression variables in the simple model with a"dummy" variable ..................................................................................................217Example of multiple regression models with indicator variables as a explanatoryvariable and a continuous variable as another variable explanatory......................217

CONDITIONS FOR ECONOMETRIC MODELS...................................................222Assumptions regression models through SPSS .....................................................222

MULTICOLLINEARITY..................................................................................222OUTLIERS ........................................................................................................223NORMALITY....................................................................................................224AUTOCORRELATION ....................................................................................224HETEROSKEDASTICITY ...............................................................................224

ECONOMETRIC CONDITIONS FOR REGRESION MODELS WITH SPSSEXAMPLES ..........................................................................................................225

References..................................................................................................................282

DESCRIPTIVE STATISTICSEXAMPLES IN EXCEL

4

I. Descriptive statisticsWhat is Statistics?Statistics, in short, is the study of data. It includes:

Descriptive statistics (the study of methods and tools for collecting data, andmathematical models to describe and interpret data) and

Inferential statistics (the systems and techniques for making probability-baseddecisions and accurate predictions based on incomplete (sample) data).

Three main aspects in statistical dealing with data are:

1. The collection of qualitative or numerical data,2. The presentation of qualitative or numerical data and3. The analysis of numerical data with appropriate statistical methods and models.

Scales of measurementDifferent scales of measurement have correspondence with appropriate data type.

1. Nominal scale

Nominal scale classifies data into various distinct categories in which no ordering isimplied. Nominal variables might be used to identify different attributes. For examplenominal scale is appropriate for:

Gender Citizenship Internet provider that you prefer. The license plate number of a car

The only comparisons that can be made between variable values are equality andinequality. There are no "less than" or "greater than" relations among them, noroperations such as addition or subtraction.

2. Ordinal scale

Ordinal scale classifies data into various distinct categories in which no ordering isimplied. Ordinal scale is in direct connection with ranking. For example there isproduct satisfaction, because you can be: very satisfied, satisfied, neutral,unsatisfied or very unsatisfied.

Comparisons of better and worst can be made, in addition to equality and inequality.However, operations such as conventional addition and subtraction are still withoutmeaning. While the scale can be ranked from high to low the difference betweenpoints cannot be quantified. We cannot say that the person who thinks facilities are


5

good regards the facilities as twice as good as the person who thinks they are belowaverage.

3. Ratio scale

Ratio scale is an ordered scale in which the difference between the measurementsinvolves a true zero point (height, consumption, profit, etc.). All mathematicaloperations are possible with this type of data and lead to meaningful results. There arenumerous methods for analyzing this type of data.

4. Interval scale

The most important characteristic of interval scale is that the measurement does notinvolve a true zero point. The numbers have all the features of ordinal measurementand also are separated by the same interval. Zero value is arbitrary, not real(temperature, etc.)

In this case, differences between arbitrary pairs of numbers can be meaningfullycompared. Operations such as addition and subtraction are therefore meaningful.However, the zero point on the scale is arbitrary, and ratios between numbers on thescale are not meaningful, so operations such as multiplication and division cannot becarried out. On the other hand, negative values on the scale can be used.

Categorical variables (attributes) are connected with nominal or ordinal scale, butnumerical variables are connected with ratio or interval scale.

Discrete and continuous variablesNumerical variable can be discrete or continuous: Discrete variables produce numerical responses that arise from a counting

process. An example of a discrete numerical variable is the number of magazinessubscribed to. Another example would be the score given by a judge to agymnast in competition: the range is 0 to 10 and the score is always given to onedecimal (e.g., a score of 8.5). The response is one of a finite number of integers,so a discrete variable can only take a finite number of real values.

Continuous variable produce numerical responses that arise from a measuringprocess. The response takes on any value within a continuum or interval,depending on the precision of the measuring instrument. Examples of acontinuous variable are distance, age, height, consumption, revenue, loan amount,export/import...

Data collectingDepending on the scope of research, data can be collected from a whole population orfrom a part of population (a sample).


6

Census

A survey of a whole population is called a census. A census refers to data collectionabout every unit in a group or population. If you collected data about the height ofeveryone in your class, that would be regarded as a class census. A characteristic of apopulation (such as the population mean) is referred to as a parameter.

There are various reasons why a census may or may not be chosen as the method ofdata collection:

Census dataAdvantages (+)

Sampling variance is zero: There is no sampling variability attributed to the statisticbecause it is calculated using data from the entire population.Detail: Detailed information about small sub-groups of the population can be madeavailable.

Disadvantages ()Cost: In terms of money, conducting a census for a large population can be veryexpensive.Time: A census generally takes longer to conduct than a sample survey.Control: A census of a large population is such a huge undertaking that it makes itdifficult to keep every single operation under the same level of scrutiny and control.

Sampling

Sampling frame is a complete or partial listing of items comprising the population.The frame can be data sources as population lists, directories or maps. Samples aredrawn from this frame. If the frame is inadequate because certain groups if individualsor items in the population were not properly included, then the samples will beinaccurate and biased.

The sampling process comprises several stages:

Defining the population of concern, Specifying a sampling frame, a set of items or events possible to measure, Specifying a sampling method for selecting items or events from the frame, Determining the sample size, Implementing the sampling plan, Sampling and data collecting, Reviewing the sampling process.

Examples of sample surveys:

Phoning the fifth person on every page of the local phonebook and asking themhow long they have lived in the area.

Selecting several cities in a country, several neighbourhoods in those cities andseveral streets in those neighbourhoods to recruit participants for a survey.


7

A characteristic of a sample (such as the sample standard deviation) is referred to as astatistic.

Reasons one may or may not choose to use a sample survey include:

Sample surveyAdvantages (+)

Cost: A sample survey costs less than a census because data are collected from onlypart of a group.Time: Results are obtained far more quickly for a sample survey, than for a census.Fewer units are contacted and less data needs to be processed.Control: The smaller scale of this operation allows for better monitoring and qualitycontrol.

Disadvantages ()Sampling variance is non-zero: The data may not be as precise because the datacame from a sample of a population, instead of the total population.Detail: The sample may not be large enough to produce information about smallpopulation sub-groups or small geographical areas.

Types of sample

Simple random sample

A simple random sample is selected so that every possible sample has an equal chanceof being selected from the population. Each individual is chosen randomly andentirely by chance, such that each individual has the same probability of being chosenat any stage during the sampling process.

In small populations such sampling is typically done without replacement. Thismeans that person or item once selected is not returned to the frame and thereforecannot be selected again. An unbiased random selection of individuals is important sothat in the long run, the sample represents the population. However, this does notguarantee that a particular sample is a perfect representation of the population.

Although simple random sampling can be conducted with replacement instead, this isless common and would normally be described more fully as simple random samplingwith replacement. This means that person or item once selected is returned to the

frame and therefore can be selected again with the same probability 1N

.

Advantages are that a random sample is free of classification error and it requiresminimum advance knowledge of the population. Random sampling best suitssituations where not much information is available about the population and datacollection can be efficiently conducted on randomly distributed items.


8

Stratified sample

When sub-populations vary considerably, it is advantageous to sample eachsubpopulation (stratum) independently. Stratification is the process of groupingmembers of the population into relatively homogeneous subgroups before sampling.

The strata should be mutually exclusive: every element in the population must beassigned to only one stratum. The strata should also be collectively exhaustive: nopopulation element can be excluded. Then random or systematic sampling is appliedwithin each stratum. This often improves the representativeness of the sample byreducing sampling error.

In general, the size of the sample in each stratum is taken in proportion to the size ofthe stratum. This is called proportionate allocation. If the population consists of 60%in the male stratum and 40% in the female stratum, then the relative size of the twosamples (three males, two females) should reflect this proportion.

Cluster sampling

The problem with random sampling methods when we have to sample a populationthat is disbursed across a wide geographic region is that you will have to cover a lot ofground geographically in order to get to each of the units you sampled. It is forprecisely this problem that cluster or area random sampling was invented.

In cluster sampling, we follow these steps: divide population into clusters (usually along geographic boundaries) randomly sample clusters measure all units within sampled clusters.

Cluster samples are generally used if: No list of the population exists. Well-defined clusters, which will often be geographic areas, exist.

Often the total sample size must be fairly large to enable cluster sampling to be usedeffectively.

Quota sampling

Quota sampling is the non-probability equivalent of stratified sampling. Likestratified sampling, the researcher first identifies the stratums and their proportions asthey are represented in the population. Then convenience or judgment sampling isused to select the required number of subjects from each stratum. This differs fromstratified sampling, where the stratums are filled by random sampling.

There are two types of quota sampling: proportional and non-proportional. Inproportional quota sampling you want to represent the major characteristics of thepopulation by sampling a proportional amount of each. For instance, if you know the


9

population has 40% women and 60% men, and that you want a total sample size of100, you will continue sampling until you get those percentages and then you willstop.

Non-proportional quota sampling is a bit less restrictive. In this method, youspecify the minimum number of sampled units you want in each category. Here,you're not concerned with having numbers that match the proportions in thepopulation. Instead, you simply want to have enough to assure that you will be able totalk about even small groups in the population.

Systematic sampling

Systematic sampling is a statistical method involving the selection of every kthelement from a sampling frame, where k, the sampling interval, is calculated as:

k = population size (N) / sample size (n)

Using this procedure each element in the population has a known and equalprobability of selection. This makes systematic sampling functionally similar tosimple random sampling. It is however, much more efficient and much less expensiveto carry out. The researcher must ensure that the chosen sampling interval does nothide a pattern. Any pattern would threaten randomness. A random starting point mustalso be selected.

Systematic sampling is to be applied only if the given population is logicallyhomogeneous, because systematic sample units are uniformly distributed over thepopulation.

Calculating a Sample SizeThe three most important factors that determine sample size are: How accurate you wish to be? How confident you are in the results? What budget you have available?

The temptation is to say all should be as high as possible. The problem is that anincrease in either accuracy or confidence (or both) will always require a larger sampleand higher budget. Therefore, a compromise must be reached.

Frequency distributionFirst result that we get after research is series with gross data. It is a database inwhich we entered data for each item or object without any order (piled data). Inorder to get an arranged statistical series (ordered array), we need to sort data byorder of magnitude (from smallest observation to the largest observation). The easiest


10

method of organizing data is a frequency distribution, which converts raw data intoa meaningful pattern for statistical analysis.

Well, the final form of data grouping is the statistical distribution of frequencies, inwhich each variable modality or interval (there is n of modalities or intervals)associate a corresponding absolute frequency if (number of times each value(modality or class) appears or number of occurrences of a modality or class) ,i ix f or 1, 1, 1 ,i i iL L f .The number of class groupings used depends on the number of observations in thedata (N). In general, the frequency distribution should have at least 5 class groupingsbut no more than 15.

When a variable can take continuous values instead of discrete values or when thenumber of possible values is too large, the table construction is cumbersome, if it isnot impossible. A slightly different tabulation scheme based on the range of values(classes or intervals) is used in such cases 1, 1, 1 ,i i iL L f .Frequency distribution tables can be used for both categorical and numeric variables.Continuous variables should only be used with class intervals.

The relative frequency is proportion of units of a statistical set with the samemodality or interval. This relative frequency of a particular modality or class intervalis found by dividing the absolute frequency by the number of observations:

1, 1

ni

i ii

fp pN

.The percentage frequency is found by multiplying each relative frequency value by100. The percentage frequency is shown in percentages, and it has the same meaninglike the relative frequency:

1100 100, 100

ni

i i ii

fP p PN

Cumulative frequency (CF) is used to determine the number of observations that lieabove (or below) a particular value in a data set (how many data have the value that isequal to or lower than the value of present modality). The cumulative frequency iscalculated using a frequency distribution table. The cumulative frequency iscalculated by adding each frequency from a frequency distribution table to the sum ofits predecessors.

1

i

i jj

S f

The last value will always be equal to the total for all observations, since allfrequencies will already have been added to the previous total.

Cumulative percentage (CF%) is used to determine the percentage or part ofobservations that lie above (or below) a particular value in a data set (which part or %


11

data have the value that is equal to or lower than the value of present modality). It iscalculated by adding each percentage frequency from a frequency distribution table tothe sum of its predecessors:

1

i

i jj

F P

.Excel solution for frequency distribution creating:1. For qualitative data:

o Create column with modalities.o In next column for first cell behind first modality choose Excel

function Statistical - Countif Range - row or column or array with original data (fix that

range with $) Criteria description of modality ()

o For other modalities do this with Copy option. For numerical data:

o Create new columns, one with lower and one with upper endpoints ofclasses,

o Select free cells beside that column,o Choose Excel function Statistical Frequency,

Data array row or column or array with original data, Bins array new column with upper endpoint of classes, CTRL+SHIFT+ENTER,

o That will produce absolute frequencies for all classes.

Example 1.According to data base for HBS 2004 we have information about several variables for7,413 households: Entity Canton Gender Marital status Education level Employment statusWe have qualitative variables with small number of modalities, so we will use non-interval grouping, or we will find absolute frequency for each modality.

First, we will in empty column of Excel sheet type modalities for given variable. Wewill take variable marital status and modalities are: unmarried, married, unformalmarriage, divorced and widower/widow.

For construction of frequency distribution we will use Excel function: COUNTIF:


12

Now we will give elements to the chosen CONTIF function: Range row or column with original data (we will fix that data range with $:

$D$2:$D$7414) Criteria cell with given modality (H10)


13

With Copy-Paste option we will complete other cells for absolute frequency:


14

On the same way we can complete frequency distribution for other variables.

Next step is to calculate relative and percentage frequencies according to absolutefrequencies:

1. we will get relative frequency when we divide absolute frequency with sample orpopulation size (N) like sum for absolute frequencies (when we give sum wealways fix series with $):

Other relative frequencies we will get with Copy-Paste option and sum of relativefrequencies has to be equal 1:


15

2. Percentage frequency we will get when multiply relative frequency with 100%,so we will transform part in percentage form:

Other percentage frequencies we will get with Copy-Paste option and sum ofpercentage frequencies has to be equal 100:


16

Interpretation: Highest part (71.24%) households has head in formal marriage, butlowest part (0.27%) households has head in unformal marriage

When we have qualitative variable there is no any sense to calculate cumulativefrequency, because there is no logical explanation.

Example 2.We have data base about import and export in year 2007. for 181 countries (Doingbusiness 2007 trading across boundaries). Variable number of documents forexport is example for discrete quantitative variable. For construction of frequencydistribution we will use option FREQUENCY.

First we will find minimal and maximal value of modality with statistical functionMIN and MAX:


17

Minimal value of modalities is 3 and maximal value is 14, so we will according tothat take modalities from interval 3-14 in new column (I8:I19) for frequencydistribution:

Then we will select all cells where we need absolute frequencies (J8:J19) and wechoose in Functions: Statistical functions Frequency and:

1. Data array are original data (B2:B182)2. Bins array are modalities (I8:I19)


18

Than in the same time we press CTRL+SHIFT+ENTER and we will get frequencydistribution:

According to sum of absolute frequencies (175) we can see that for 6 countries dataabout this variable are missing.

Next step is to calculate relative, percentage and cumulative frequencies according toabsolute frequencies:


19

1. we will get relative frequency when we divide absolute frequency with sample orpopulation size (N) like sum for absolute frequencies (when we give sum wealways fix series with $):

Other relative frequencies we will get with Copy-Paste option and sum of relativefrequencies has to be equal 1:

3. Percentage frequency we will get when multiply relative frequency with 100%:


20


Interpretation: Highest part of countries (19,43%) ask for 6 documents for exportrealization, but lowest part of countries (1,14%) ask for 13 or 14 documents for exportrealization.


21

4. Increasing cumulative frequencyFirst increasing cumulative frequency is always same as first absolute frequency andthen we on current cumulant add next absolute frequency:

Other cumulative frequency we will get with option Copy-Paste and last cumulativefrequency has to be equal N:

Interpretation: 149 countries ask 9 or less than 9 documents for export realization.

5. Increasing cumulative percentage frequencyFirst increasing cumulative percentage frequency is always same as first percentagefrequency and then we on current cumulant add next percentage frequency:


22

Other cumulative percentage frequency we will get with option Copy-Paste and lastcumulative percentage frequency has to be equal 100:

Interpretation: 61,14% countries ask 7 or less than 7 documents for export realization.

Class intervals

Class interval width is the difference between the lower and upper endpoint of aninterval ( 2, 1,i i il L L ).

In summary, follow these basic rules when constructing a frequency distribution tablefor a data set that contains a large number of observations: find the lowest and highest values of the variables,


23

decide on the width of the class intervals and form class intervals that are mutuallyexclusive,

include all possible values of the variable.

In an interval grouped series, in order to provide for additional data calculation, weneed to approximate the intervals to corresponding class middles (class mark,midpoint, centre of interval):

1, 2,

2i i

i

L Lc

.

Example 3.We have data base about import and export in year 2007. for 181 countries (Doingbusiness 2007 trading across boundaries). Variable cost to import is example forcontinuous quantitative variable. For construction of frequency distribution we willuse option FREQUENCY.

First we will find minimal and maximal value of modality with statistical functionMIN and MAX:


24

Minimal value is 367 and maximal value is 5.520, and according to that we willdetermine interval for frequency distribution. We will take intervals with width 500and in next cells we will type boundaries for that intervals (truing to be visuallysymmetric):

When we set up boundaries for intervals then we can go to the function Frequency.We will select all cells where we want to find absolute frequencies (K8:K19),Kada smo odredili granice intervala moemo pristupiti funkciji Frequency. We selectall cells where we want to find absolute frequencies (K8:K19) and we choose inFunctions: Statistical functions Frequency and:

1. Data array are original data (G2:G182)


25

2. Bins array are upper boundaries for intervals (that are included in currentinterval) (J8:J19)

We pres at the same time CTRL+SHIFT+ENTER and we will get frequencydistribution:

According to sum of absolute frequencies (175) we can see that for 6 countries dataabout this variable are missing.

Frequency distribution looks like:


26

Next step is to calculate relative, percentage and cumulative frequencies according toabsolute frequencies:

1. we will get relative frequency when we divide absolute frequency withsample or population size (N) like sum for absolute frequencies (when wegive sum we always fix series with $):

Copy-Paste option is used to give other relative frequencies:


27

Sum of relative frequencies is 1.

2. Percentage frequency we will get when multiply relative frequency with100%:



28

Interpretation: Highest part (32,57%) countries have cost of import per container ininterval 1000-1500 US$., but lowest part of them (0,57%) have cost of import percontainer in intervals 3000-3500 or 5500-6000. Because of that we can conclude thatinterval 5500-6000 or data from that interval is outlier.

3. Increasing cumulative frequency

First is equal to first absolute frequency and than we use cumulation:

Then we use Copy-Paste option:


29

For example, one of the conclusions can be that 170 countries have cost for importlower than 4000 US$

4. Increasing cumulative percentage frequency

Procedure is same like in previous step but with percentage frequencies:

Then we use Copy-Paste option:


30

For example, one of the conclusions can be that 90,29% countries have cost forimport lower than 2500 US$.

Outliers

An outlier is an extreme value of the data. It is an observation value that issignificantly different from the rest of the data. There may be more than one outlier ina set of data.

Sometimes, outliers are significant pieces of information and should not be ignored.Other times, they occur because of an error or misinformation and should be ignored.

Data presentation: tables, diagrams and graphsTwo most important ways for presenting data are previously presented tables withfrequency distributions and graphs.

Why use graphs when presenting data? Because graphs: are quick and direct highlight the most important facts facilitate understanding of the data can convince readers can be easily remembered.

Knowing what type of graph to use with what type of information is crucial.Depending on the nature of the data and variable type some graphs might be moreappropriate than others. You too can experiment with different types of graphs andselect the most appropriate. There are several suggestions for appropriate selectionaccording to effects that you want to get with graphs:


31

pie chart (description of components) horizontal bar graph (comparison of items and relationships, time series) vertical bar graph (discrete variable, comparison of items and relationships, time

series, frequency distribution) line graph (time series and frequency distribution) scatter plot (analysis of relationships) histogram (continuous variable).

In Excel in segment Tools Customize Insert- Chart we can find function Chart andchoose different types of graphs:

Example 1.We will again work with variable marital status. What types of graphs we can use?According to the variable type qualitative variable, we can construct structural pieor vertical bars.

For this example we will construct structural pie:


32

We choose option Next:

We choose option Next:1. Titles - we give title to the graph

2. Legend We choose way to represent legend


33

3. Data labels we choose options to show on pie: variable name, modality name,absolute frequency, %. We will take to show % because we already have modalitynames in legend.

We choose option Next and determine place where graph will be saved:


34

We choose option Finish:

Example 2.Variable number of documents for export is discrete variable. Because of that wewill choose structure pie, vertical bars or frequency polygon to represent it.

We will construct graph for vertical bars:


35

We choose option Next:

In Series option we will fix values for modalities (I8:I19):

We choose option Next:a) Titles we will determine title for graph and axes


36

b) Axes we set up axesc) Gridlines we set up gridlinesd) Legend we choose to include legend and how to do that. If we have only one

variable than legend is not important. But if we have more variables we willuse legend to classify variables.

e) Data labels we choose options to show on graph: variable name, modalityname, absolute frequency, %. We will take to show absolute frequencies:

f) Data table if we include this option we will get table below graph, but this issame information like information on graph.


37

We choose option Next and we determine place where graph will be saved:

We choose option Finish:

Example 3.This is continuous variable cost of import. Because of that we prefer to usehistogram, frequency polygon or polygon of cumulative frequency.

A. First we will construct histogram. Procedure is same like with vertical bars. On theend when we get graph with vertical bar we will on the graph make format for gapwithin bars, to be equal 0:


38

We click on bars on Excel graph and we choose Format data series, and then wechoose Options where we make that Gap width be equal to 0:

Click OK and there is histogram (graph with continuous bars):


39

B. Now we will construct polygon of absolute frequency. We need centres of intervalsfor that. We need columns with lower and upper boundaries for intervals. Centre ofinterval is sum of lower and upper boundary divided by 2:

Others centre of intervals we will get with Copy-Paste function:

Now we can to construct polygon of frequency:


40

We choose Next and select in Data range cells with absolute frequencies:

For Series we select centers of intervals like modalities:


41

Again we use option Next:a) Axes we set up axesb) Gridlines we set up gridlinesc) Legend we choose to include legend and how to do that. If we have only one

variable than legend is not important. But if we have more variables we will uselegend to classify variables.

d) Data labels we choose options to show on graph: variable name, modality name,absolute frequency, %. We will take to show absolute frequencies.

e) Data table if we include this option we will get table below graph, but this issame information like information on graph:


42

We choose option Next and we determine place where graph will be saved:

On the same way we can create polygon of cumulative frequencies, but in that case onthe beginning in Data range we would select cells with cumulative frequencies.

poligon of cumulative procentual frequencies

3,4286

34,8571

67,428682,8571

99,428690,2857

95,428696,0000

97,142998,2857

99,4286100,0000

0,0000

20,0000

40,0000

60,0000

80,0000

100,0000

120,0000

1 2 3 4 5 6 7 8 9 10 11 12

centre of interval

CF%

Descriptive statisticsDescriptive statistics are used to describe the basic features of the data in a study.Together with simple graphics analysis, they form the basis of virtually everyquantitative and qualitative analysis of data.

There may be several objectives for formulating a summary statistic or parameter: To choose a statistic that shows how different units seem similar. Statistical

textbooks call one solution to this objective, a measure of central tendency. To choose another statistic that shows how they differ. This kind of statistic is

often called a measure of statistical variability. To analyze shape of frequency distribution.


43

Measures of central tendency

Measures of central tendency summarize a list of numbers by a "typical" value calledmeasure of location. The three most common measures of location are the mean, themedian, and the mode. The mean (average) is the sum of the values, divided by the number of values. It

has the smallest possible sum of squared differences from members of the list.

1

N

ix

XN

The median is the middle value in the sorted list. It has the smallest possible sumof absolute differences from members of the list frequencies. The first modality or

interval in which it is2 MeN CF is the median or interval in which the median

is contained. If it is an interval, then the median is determined using the followingformula:

1

12

( )e

e e

e

M

e M MM

N CFM L l f R

The mode is the most frequent value in the list (or one of the most frequent values,

if there are more than one). Mode is only calculated for the statistical distribution(grouped series). It is graphically determined in a histogram. For a non-intervalgrouped distribution, on the basis of the highest frequency ( max Mof f ) the moddata is read. For an interval grouped distribution, the frequency of the readinterval opposed to the highest frequency is determined on the basis of thefollowing formula:

11 1 1o oo o o o o oM M

o M MM M M M

f fM L l f f f f

Sometimes, we choose specific values from the cumulative distribution functioncalled quartiles. Procedure is same like with median:

25% of data has value less or equal to the first quartile and 75% of data hasvalue higher than the first quartile (theoretical position

14 QN CF )

75% of data has value less or equal to the third quartile and 25% of data hasvalue higher than the third quartile (theoretical position

3

34 QN CF ).

Measures of dispersion

Dispersion refers to the spread of the values around the central tendency. There arethree common absolute measures of dispersion:


44

The rangeThe range is simply the highest value minus the lowest value: max minRV x x .

The quartile rangeThe quartile range ( 3 1QI Q Q ) is the range from the 25th to the 75th percentileof a distribution. It represents the "Middle Half" of the data and is a marker ofvariability or spread that is robust to outliers.

The standard deviationThe standard deviation is the square root of the sum of the squared deviationsfrom the mean divided by the number of scores (or the number of scores minusone, if we work with sample).

For population: 22 21

1,

N

ii

x XN

For sample: 22 21

1,

1

N

ii

x XN

The standard deviation allows us to reach some conclusions about specific scoresin our distribution. Assuming that the distribution of scores is normal or bell-shaped (or close to it!), the following conclusions can be reached (role six sigma):

approximately 68% of the scores in the sample fall within one standarddeviation of the mean

approximately 95% of the scores in the sample fall within two standarddeviations of the mean

approximately 99% of the scores in the sample fall within three standarddeviations of the mean.

Problem with standard deviation, like absolute measure of dispersion, is that wecan not use standard deviation for comparison of series with different unit ofmeasure or with different average.

Behind that we can define relative measures of dispersion like: Coefficient of variation

The variance coefficient is a relative measure of variability which can be used forcomparing series with different units of measure, because it is an unnamednumber.

100 (%)VX

It can be used for comparing series with different arithmetic means.

z valueZ values determine the relative position of variable modality in the series:

, 1, 2,...,iix X

z i N

They are appropriate for comparing positions of data in different series. Z valuesare specific because of fact that we can calculate z value for each modality, notonly for series of data.


45

The quartile deviation coefficientThe quartile deviation coefficient is relative dispersion indicator and showsvariability around median value:

3 11 3 1 3

100% 100%QQIQ QV Q Q Q Q

Higher value of the quartile deviation coefficient indicates greater dispersion andvice versa. This is relative indicator of data varying around the median.

Shape of distribution

Symmetry or skewness

A frequency distribution may be symmetrical or asymmetrical. Imagine constructing ahistogram centred on a piece of paper and folding the paper in half the long way. Ifthe distribution is symmetrical, the part of the histogram on the left side of the foldwould be the mirror image of the part on the right side of the fold. If the distribution isasymmetrical, the two sides will not be mirror images of each other. True symmetricdistributions include what we will later call the normal distribution. Asymmetricdistributions are more commonly found.

Measure of skewness33

3

331

1 Ni

ix X

N

03 symmetry3 0 positively skewed3 0 negatively skewed

X

f

symmetric

left asymmetricright asymmetric


46

If a distribution is asymmetric it is either positively skewed or negatively skewed. Adistribution is said to be positively skewed if the scores tend to cluster toward thelower end of the scale (that is, the smaller numbers) with increasingly fewer scores atthe upper end of the scale (that is, the larger numbers). A negatively skeweddistribution is exactly the opposite. With a negatively skewed distribution, most of thescores tend to occur toward the upper end of the scale while increasingly fewer scoresoccur toward the lower end.

Kurtosis

Another descriptive statistic that can be derived to describe a distribution is calledkurtosis. It refers to the relative concentration of data in the centre, the upper andlower ends (tails), and the shoulders of a distribution. A distribution is platykurtic ifit is flatter than the corresponding normal curve and leptokurtic if it is more peakedthan the normal curve.

Modality

A distribution is called unimodal if there is only one major "peak" in the distributionof scores when represented as a histogram. A distribution is bimodal if there are twomajor peaks. If there are more than two major peaks, we call the distributionmultimodal.

Measure of kurtosis 44 4

441

1 Ni

ix X

N

4 3 normal4 3 leptocurtic4 3 platykurtic


47

Measure of concentration

The Lorenz curve is a graphical representation of the cumulative distributionfunction of a probability distribution; it is a graph showing the proportion of thedistribution assumed by the bottom y% of the values. It is often used to representincome distribution, where it shows for the bottom x% of households, whatpercentage y% of the total income they have.

Every point on the Lorenz curve represents a statement like "the bottom 20% of allhouseholds has 10% of the total income". A perfectly equal income distribution wouldbe one in which every person has the same income. In this case, the bottom N% ofsociety would always have N% of the income. This can be depicted by the straightline y = x; called the line of perfect equality.

By contrast, a perfectly unequal distribution would be one in which one person has allthe income and everyone else has none. In that case, the curve would be at y = 0 forall x < 100%, and y = 100% when x = 100%. This curve is called the line of perfectinequality.

The Ginny coefficient is the area between the line of perfect equality and theobserved Lorenz curve, as a percentage of the area between the line of perfectequality and the line of perfect inequality. This equals two times the area between theline of perfect equality and the observed Lorenz curve.


48

1concentration area 2 concentration area 2

0,51

0 1j j j

G S

G p Q QG

The higher the Ginny coefficient, the more unequal the distribution is.

Software Excel and SPSS do not offer the option to directly calculate measures ofconcentration, and we therefore have based on a formula in Excel, so we develop theprocedure.

Example 4.We have data base about variables that follow procedure of paving taxes for 181countries (source: http://www.doingbusiness.org/CustomQuery/, data for 2008. year).Data are given in Excel sheet (A1-G363). Variables are: Payments (number) (B2-B363) Time (hours) (C2-C363) Total tax rate (% profit) (D2-D363).There are quantitative variables, so we can apply methodology for descriptivestatistics for series of 181 data per each variable to get several parameters which willdescribe given series.

Most simple and fast way to get several parameters which will describe given series(x min, x max, average, deviation, mod, median, kurtosis and skewness) is to use Excelfunction: Tools Data Analysis. If that option is not included we have to renew it:

1. Tools Add-ins:


49

2. We have to renew or choose Analysis ToolPak and Analysis ToolPak VBA:

3. Click OK and we will get in Tools:

Now we can use Data Analysis option:


50

We will get list with analysis that we can make. Currently we are interested for optionDescriptive statistics, so we choose it and click OK. In Input range we can in the sametime to select all columns with several variables and to give grouping according to thecolumns ($B$1:$D$182). When we select data we include and first cell with variablename and include option Labels in first row. Then we set up empty cell or new sheetwhere we want to save result of analyses and we select what we want to get ofparameters:

Summary statistics - x min, x max, average, deviation, mod, median, kurtosis andskewness, range, count...

Confidence level for mean This is boundary for confidence interval foraverage with given confidence level (for example 95%)

Kth largest i Kth smallest If we want to calculate quantiles we will choosethis option , for example for first and third quartile in both case we take 25, forfirs and ninth decile in both case we take 10

Click OK and result is:


51

On example on of this variables time (hours) we will give interpretations for results: Average is 317.63 hours, for sample of 181 countries (count), so in average it

is needed 317.63 hours for paying taxes procedure. Standard error of average estimation is given on base of sample size and

standard deviation in sample ( Xn

) is 23.61 hours. Median is 256, so for 50% of countries is needed 256 hours or less for paying

taxes procedure until for 50% of countries is needed more than 256 hours forpaying taxes procedure.

Mod is 270, so we have most frequently appeared country with 270 hours forpaying taxes procedure.

Standard deviation like average linear deviation from average is 317.66 hours,so we can calculate coefficient of variation:

317.66100 100 100%317.63

VX

Relative variability around average is 100%. Only in comparison with anotherseries this information has sense.

Variance like average square deviation from average is 100906.1, but weinterpret this through standard deviation.

Kurtosis is (19.96+3)=22.96 what is more than 3 so we can conclude that thisdistribution is significantly more peaked than the normal curve.

Skewness is 3.77 what is more than 0 so we can conclude that this distributionis significantly right asymmetric in comparison with the normal curve

Range like difference between highest and lowest value is 2600 h. Minimal time for paying taxes procedure is 0 h. Maximal time for paying taxes procedure is 2600 h. Sum of data in series is 57491, but there is no logic interpretation for this

information.


52

Third quartile is 453, so for 75% of countries is needed 453 hours or less forpaying taxes procedure until for 25% of countries is needed more than 453hours for paying taxes procedure.

Third quartile is 105, so for 25% of countries is needed 105 hours or less forpaying taxes procedure until for 75% of countries is needed more than 105hours for paying taxes procedure.

Boundary for confidence interval for average with given confidence level 95%is 46.59. Confidence interval for average with 95% confidence level is[317.6346,59]= [271.04-364.22]. So with first type error 5% we canconclude that time for paying taxes procedure for some country will be Iinterval [271.04-364.22] hours.

To see these parameters visually we will construct histogram. We have option in Dataanalysis:

Before we construct histogram we have to define intervals according to minimal andmaximal value and to numbers of interval that we want to create. Maximal value is2600 and minimal value is 0, so we will determine intervals with width 100: 0-100,100-200, ..., 400-500, 500-600, ..., 2500-2600. Upper limits for that intervals that areincluded in intervals are: 99, 199, ..., 499, 599, ..., 2600. We will type this limits inone Excel column (I22:I47).

For Input range we will select column with original data (C2:C182) and for BinRange we will select cells where we type upper limits for intervals (I22:I47). We willfind place to save result and option Chart output:


53

Graph that we are get is graph with vertical bars, but we will click on graph and getChart options Options. There we will set up that gap between bars be equal 0:

Finally histogram looks like:


54

Histogram

0

10

20

30

40

50

60

99 299

499

699

899

1099

1299

1499

1699

1899

2099

2299

2499

More

Bin

Freq

uenc

y

Our interpretation of parameters for distribution shape is completely proved. It is verypositive (right) asymmetric and peaked distribution. This distribution is significantlydifferent in comparison with normal curve.

Example 5.With aim to analyse concentration for consumption for data base HBS 2008, we aretaken data about consumption per capita for 23374 individuals from 7071 households:

There are original gross data, so we will firs construct appropriate frequencydistribution. We need to find minimal and maximal value for consumption level in oursample:


55

According to that we make decision to set up intervals with width 5000, so we haveupper limits that are included in intervals (bins): 4999,99, 9999,99, 14999,99, ,54999,99. That limits we will type in empty column in sheet where are original data:


56

We select empty cells in column behind (E6:E16). In function (fx) we chooseFrequency:

With CTRL+SHIFT+ENTER we will get frequency distribution:


57

Now we can to start with construction of Lorenz curve and calculation of Ginnycoefficient. We need centres of intervals and relative frequencies, but before that wehave to form columns with lower and upper limits for intervals:

First we will calculate centres of intervals:


58

With Copy-Paste option we will get column with centres of intervals:

Than we will calculate relative frequencies:


59

With Copy-Paste option we will get column with relative frequencies:

Than, we will calculate relative cumulative frequencies. First is same like first relativefrequency and we follow cumulating:

With Copy-Paste option we will get column with relative cumulative frequencies:


60

Than we need cumulant for relative aggregate. First we will calculate aggregate (cp)like product of centre of interval and absolute frequency for given interval:

With Copy-Paste option we will get column for aggregate:

We will calculate relative aggregate like: i iii i

c pqc p :

With Copy-Paste option we will get column for relative aggregate:


61

On the end we will find relative cumulative aggregate (Q):

With Copy-Paste option we will get column for cumulant of relative aggregate:

To graph Lorenz curve for x axes we will take relative cumulative frequencies andlike y axes we will take cumulant of relative aggregate. Before that we will insert onepoint with value 0 for both cumulant:


62

Now we can graph Lorenz curve:

For line of perfect equality we will for both axes take same data for relativecumulative frequencies.

For Lorenz curve we take:


63

Now with Add we will insert new series for line with perfect equality:

We choose Next and then we will get option to give titles:


64

Finally graph looks like this:

White area is area of concentration.

We will calculate Ginny coefficient like quantification for measure of concentrationaccording to relation: 11 j j jG p Q Q :

With Copy-Paste option we will complete this column:


65

When we calculate (1-this sum) we will get Ginny coefficient:

And we will get result:


66

Ginny coefficient is 0.3378 so distribution of consumption is not perfect equal butlevel of concentration is not very high.

EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONSEXAMPLES IN EXCEL

67

II. Empirical versus appropriate theoretical distributions(approximations with binomial; Poisson,hypergeometric or normal distribution)

PROBABILITY DISTRIBUTIONS

Frequency distribution formed with groupation of population units according to samecharacteristics is empirical distribution. Distribution formed on the basis of theoreticalprepositions is theoretical distribution. Main characteristics of theoretical distributionsare:

We suppose them in some statistical model or we create them like hypothesisthat we have to test.

Theoretical distributions are given like analytic model with known parameters:expectation, mod, median, standard deviation, skewness and kurtosis.

Theoretical distributions are given like probability distributions.

Probability where we know number of possible outcomes of event and we knownumber of success realization is a priory probability. But in statistical research ismost frequently that we dont know probability a priori so with experiment we try toget knowledge for probability calculations like a posterior. Well a posteriorprobability is empirical or statistical probability.

Empirical probability or a posterior is limited value for relative frequency for numberof success of event A if we have great number of trials: which tends to infinity:

( ) limn

mp An

; m- number of success realization, n- number of trials.

Cumulative function for discrete variable X (F(x)) is function that x will take valueslower or equal to same real number ix or ( )

i

i i iX x

F x P X x p x

.Cumulative function for continuous variable X (F(x)) has general formlike

a

dxxfaXF , and it is determined by parameters like expectation andvariance..

If discrete variable X (F(x)) can take values kxxx ,...,, 21 withprobabilities kxpxpxp ,...,, 21 , where sum of probabilities has to be 1, expectationfor X is :

iki

ikk xxpxxpxxpxxpXE 1

2211 ... .

For continuous variable expectation is:


68

dxxxfXE , - x .

Variance for discrete variable is:

XExpxX ki

ii

,1

222 odnosno

211

22

k

iii

k

iii xpxxpx .

Variance for continuous variable is:

dxxxfdxxfxXE

,2222 .

Well, theoretical probability distributions can be split into 2 groups:

discrete probability distributions deal with discrete eventso binomial distributiono Poisson distributiono Hypergeometric distribution.

continuous probability distributions deal with continuous eventso normal distributiono Student (t) distributiono 2 (chi-square) distributiono F distribution.

The probability distribution of a random variable describes the probability off allpossible outcomes. The sum (integral) of these probabilities will equal 1.

BINOMIAL DISTRIBUTIONThe binomial distribution is used when discrete random variable of interest is thenumber of successes obtained in a sample of n observations. It is used to modelsituations that have the following properties: The sample consists of a fixed number of observations n. Each observation is classified into one of two mutually exclusive categories,

usually called success and failure. The probability of an observation being classified as success, noted as p, is

constant from observation to observation. Thus, the probability of an observationbeing classified as failure, noted as (1-p)=q, is constant over all observations.

The outcome (success or failure) of any observation is independent of the outcomeof any other observation.

Well, binomial distribution has two parameters: n number of observations, trials or experiment repetitions.


69

p the probability of success (occurrences of a given event) on a singleobservation, trial or experiment.

Probability distribution of a binomial random variable

The probability distribution of a binomial random variable is:

( ) 1 , 0,n xxnp x p p x nx

,where x is exact number of successes of interest and ( )p x is probability that among ntrials will been realized exactly x successes (given event will be realized exactly xtimes).

Binomial probability function 1

Example 1.An insurance broker believes that for particular contact, the probability of making saleis 0.4. Suppose now that he has five contacts. What is probability that he will realizethree sales among these five contacts?

Solution:

Discrete random variable X is defined to take value 1 if sale is made and 0 if sale isnot made so this is discrete variable that can be treated with binomial distribution.Experiment of sale we will repeat 5 times n=5.According to conclusion about dichotomous variable we will apply approximationwith binomial distribution:

1 From Wikipedia, the free encyclopedia


70

(1) 0.4(0) 1 0.4 0.6

53

p pq pn

x

3 25( ) 1 (3) 0.4 0.6 0.233

n xxn

p x p p px

Probability that he will realize three sales among these five contacts is 23%.

Characteristics of the Binomial distribution

ShapeBinomial distribution can be symmetrical (if p=0.5) or skewed (if p 0.5) Mean

( )E X n p Variance

22 (1 )E X n p p We have 4 types for binomial distribution: symmetric; if p=q=0.5 asymmetric; if p q a priori; if we know probabilities p and q a posterior; if we have to find p and q by empirical method

Conditions for approximation empirical distribution with binomial distribution are:

0 1Xn

2 1 XXn

Error of approximation is measure for quality of approximation. Error ofapproximation according to modalities is: bk k kd f f where: kf is empiricalfrequency and bkf is theoretical frequency, so overall error of approximation is:

2 211b k

dn

Example 2.Accounting office in one company has information that 40% customers don't realizeobligation on time because of inflation. If we randomly select 6 customers, what isprobability:

1. that are all customers realized obligation on time2. that more than 3/4 of customers realized obligation on time3. that 50% or more of customers don't realize obligation on time.


71

Solution:

p=60%=0,6 (realize obligation on time)q=40%=0,4 (dont realize obligation on time)n=6

( ) x n xnp x p qx

1. Probability that that

are all customers realized obligation on time according to the table isp(6)=4.67%.

2. Probability that more than 3/4 of customers realized obligation on time 3/4of 6 is 4,5 so we will take probability for x=5 and x=6. According to the tablep(5)= 18.66% and p(6)= 4.67% , so final result according to (Additionaltheorem) is 23.33%.

3. Probability that 50% or more of customers don't realize obligation on time 50% of 6 is 3, so we will take probability for x=3, 4, 5, 6. According to thetable this is (0.27648+0.311040+0.186624+0.046656)=0.8208 82.8%.

Example 3.For 1000 products we can find 28 with defect. If we randomly select 14 products forsample, what is probability that:a) in sample we have exactly 4 products with defect;b) in sample we have maximum 2 products with defect;c) in sample we have minimum 4 products with defect.

Solution (by Excel):This is dichotomous variable, so in that case we will apply Binomial distribution withmodalities - x: 0,1,2,3,4,...,14.

28 0.028 0.9721000

p q 14

, 0,14: 14( ) 0.028 0.972

k

b k kk k

x k kX

p P x kk

We will use Excel function:

x i p(x) F(x)0 0.004096 0.0040961 0.036864 0.0409602 0.138245 0.1792053 0.276480 0.4556854 0.311040 0.7667255 0.186624 0.9533496 0.046656 1.000000


72

a) in sample we have exactly 4 products with defectWe ask for probability in point not for cumulative function, so for option Cumulativewe will take False.

=BINOMDIST(4;14;0.028;FALSE)= 0.000463 0.0463%b) in sample we have maximum 2 products with defect (so 0, 1 or 2 product withdefect), this is cumulative distribution so for option Cumulative we will take True.


73

=BINOMDIST(2;14;0.028;TRUE)= 0.993662 99.3662%c) in sample we have minimum 4 products with defect 4, 5 or more products withdefect, what is opposite event for cumulative frequency (maximum 3 products withdefect or 1, 2 or 3 products with defect). Event and opposite event for sum ofprobabilities have 1, so we can use Excel to get probability for opposite event (1, 2 or3 products with defect) and than use that characteristic:

1- =BINOMDIST(3;14;0.028;TRUE)=1- 0.999509=0.000491 0.491%Example 4.For monitoring of work for one automat machine, inspector will take sample with 10products. On base of 50 samples we get this information about number of productswith defect:

Number ofproducts withdefect

Number ofsamples

0 61 112 153 10


74

4 75 1

50We have to create appropriate theoretical approximation for this empirical distribution.

Solution:

This is discrete random variable. We have two modalities in one trial: product can becorrect or with defect. That shows us that appropriate theoretical distribution isbinomial distribution. According to empirical distribution of frequencies we willcalculate average and standard deviation. We can con use Excel function directly,because this is grouped distribution and we will set up formulas for calculate averageand standard deviation:

10,50 nN

Result is:


75

Or we will create new column (xf) and sum for that column we will divide with sumof absolute frequencies:

kx kf kk fx 0 6 01 11 112 15 303 10 304 7 285 1 5

50 104

104 2.0850

k kx fXN

Then we will calculate standard deviation:

Result is:


76

Or we will create new column kk fx 2 and calculate with general formula2

22 k kx f XN

:kx kf kk fx kk fx 2

0 6 0 01 11 11 112 15 30 603 10 30 904 7 28 1125 1 10 25

50 109 298

222 2298 2.08 1.63 1.278

50k kx f X

N Now we will test that conditions for binomial approximations are satisfied:

22.081 2.08 1 1.65 110

X XX Xn n

0 0.208 1X

n

Conditions are satisfied so we can apply approximation. Then is: 0.208Xpn

and0,792q .

1010 0.208 0.709 , 0,5b x x b bx x xp x f p Nx

In Excel we will create formula for probability calculations

1010 0.208 0.709 , 0,5b x xxp xx

and than according to these theoreticalprobabilities we can compute theoretical frequencies b bx xf p N :


77

With Paste option we can complete other cells in column with theoretical probabilities.Result is:

Now we will compute theoretical frequencies:


78

With Paste option we can complete other cells in column with theoretical frequencies.Result is:

That was procedure for approximation with binomial distribution. Now we haveschedule for this variable and we can make predictions. Quality of approximation willbe measured by error of approximation.

Error of approximation for modalities is: bk k kd f f


79

Because of different signs, we will square those errors:

We will sum square of errors:


80

2 21 9.589 0.8721 11b k

dn

Error of approximation is 0.872.

POISSON DISTRIBUTIONThe Poisson distribution is a useful discrete probability distribution when you areinterested in the number of times a certain event will occur in a given unit of area ortime. This type of situation occurs frequently in a business. of opportunity approacheszero as the area of opportunity becomes smaller. The Poisson distribution has oneparameter 0 , which is average or expected number of events per unit.

Probability distribution of Poisson random variable

The probability distribution of a Poisson random variable is: ( )!

xep xx

where is: x number of events per unit (number of successes per unit)


81

( )p x is the probability of x successes given a knowledge of average number of events per unit (average number of successes per unit) e=2.71828 (constant)

Poisson probability function 2

The horizontal axis is the index k. The function is only defined atinteger values of k (empty lozenges). The connecting lines are onlyguides for the eye.

Example 5.If the probability that an individual be late on job on Friday is 0.001, determine theprobability that out of 2000 individuals.a) exactly 3b) more than 2individuals will be late on job on Friday.

Solution:

p=0.001 - probability that an individual be late on job on Friday (rare event Poisson distribution)

2000 0.001 2N p 2 2( )

! !

x xe ep xx x

a)2 32(3) 0.183!

ep

There is 18% of chance that out of 2000 individuals exactly 3 will be late on job onFriday.



82

b)

2 0 2 1 2 2

( 2) (3) (4) ... 1 (0) (1) (2)2 2 21 0.323

0! 1! 2!

p x p p p p p

e e e

There is 32.3% of chance that out of 2000 individuals more than 2 will be late on jobon Friday.

Example 6.Suppose that, on average, three customers arrive per minute at the bank during thenoon to 1 p.m. hour. What is probability that in a given minute exactly two customerswill arrive?

Solution:

We are interested in the number of times a certain event will occur in a given unit oftime Poisson distribution.

=3 3 3( )

! !

x xe ep xx x

3 23(2) 0.2242!

ep

There is 22.4% probability that at in a given minute exactly two customers will arrive.

Example 7.If probability that randomly selected person will be daltonist is 0.3% what isprobability that between 2800 persons we will find:a) 4 daltonistsb) more than 3 daltonists.c) not more than 2 daltonists.

Solution (by Excel-a):0.003 0.3%p Rare event Poisson distribution

2800 0.003 8.4n p 8,4 8.4( )

! !

x xe ep xx x

We will use Excel function:


83

a) exactly 4 daltonistsWe ask for probability in point not for cumulative function, so for option Cumulativewe will take False.

)4(XP =POISSON(4;8.4;FALSE)= 0.046648 4.6648%b) more than 3 daltonists, this is opposite to cumulative distribution so for optionCumulative we will take True and on the end we will find probability for oppositeevent:


84

1- )3(XP 1-=POISSON(3;8.4;TRUE)=1- 0.03226= 0.96774 96.774%c) not more than 2 daltonists, this is cumulative distribution so for option Cumulative

we will take True.

)2(XP =POISSON(2;8.4;TRUE)=0.0100471.0047 %

Characteristics of the Poisson distribution

ShapePoisson distribution is always positively (right) skewed. Mean

( )E X Variance

22 E X

13 ,

134 .


85

The Poisson distribution can be derived as a limiting case to the binomialdistribution as the number of trials goes to infinity and the expected number ofsuccesses remains fixed. Therefore it can be used as an approximation of thebinomial distribution if n is sufficiently large and p is sufficiently small. There is arule of thumb stating that the Poisson distribution is a good approximation of thebinomial distribution if n is at least 20 and p is smaller than or equal to 0.05.According to this rule the approximation is excellent if n 100 and np 10.

Example 8.In one office there is copy machine. We want to determine average number ofincorrect copies. We take samples with 1000 copies, number of trials was 250 andresults are:

number ofincorrect copies

Numberofsamples

0 101 202 403 554 505 406 157 108 59 310 2

250We have to create appropriate theoretical approximation for this empirical distribution.

Solution:

This is discrete random variable. We have two modalities in one trial: copy can becorrect or incorrect. That shows us that appropriate theoretical distribution is binomialor Poisson distribution. According to empirical distribution of frequencies we willcalculate average and standard deviation. We can con use Excel function directly,because this is grouped distribution and we will set up formulas for calculate averageand standard deviation:

100,250 nN


86

Result for average is:

We will find variance:


87

Result for variance is:

There is 2X Poisson distribution, 3.65X 3.65 3.65!

xpxp e

x


88

In Excel we will create formula for probability calculations 3.65 3.65 , 0!

xpxp e x

x

and than according to these theoretical probabilities we can compute theoreticalfrequencies b bx xf p N :

With Paste option we can complete other cells in column with theoretical probabilities.Result is:

Now we will calculate theoretical frequencies:


89

With Paste option we can complete other cells in column with theoretical frequencies.Result is:

That was procedure for approximation with Poisson distribution. Now we haveschedule for this variable and we can make predictions. Quality of approximation willbe measured by error of approximation.


90

Error of approximation for modalities is: bk k kd f f

Because of different signs, we will square those errors:


91

We will sun that square errors:


92

2 21 1941.47 7.761 251b k

dn

Approximation error is 7.76.


93

HYPERGEOMETRIC DISTRIBUTION

Hipergeometric distribution H(N,n,p) is distribution for n random Bernoullisdependent variables. There is sampling without replications. Symbols are: N- number of elements in population M- number of elements in population with characteristic A n- number of elements in sample k - number of elements in sample with characteristic A NMkNn ,

hkp is probability that in sample from that population be k elements with

characteristic A: n

N

knN

kN

CCC

n

Nkn

NkN

kXp

2121

Expectations and variance are:

1

; 2121

NnN

NN

NN

nNN

nXE

This distribution has application in sampling procedure. When is (n/N


94

probability that we select 0 incorrect products

==HYPGEOMDIST(0;4;9;30) = 0.21839121.84% probability that we select 1 incorrect product

==HYPGEOMDIST(1;4;9;30) = 0.43678243.68% probability that we select 2 incorrect products


95

==HYPGEOMDIST(2;4;9;30) = 0.27586227.59% Finally, probability that we will have not more than 2 incorrect products is sum of

previous find probabilities (like or probability for mutually excluded events) 0.931034 93.1%

NORMAL DISTRIBUTIONThe normal distribution, also called the Gaussian distribution, is an important familyof continuous probability distributions, applicable in many fields. Each member of thefamily may be defined by two parameters, location and scale: the mean ("average", )and variance (standard deviation squared, 2) respectively.

The continuous probability density function of the normal distribution is the Gaussianfunction:

2

121

, , ( )2

x Ei

i ix f x e

where > 0 is the standard deviation, the real parameter is the expected value. Toindicate that a real-valued random variable X is normally distributed with mean andvariance 0, we write

2( ; )X N


96

Normal probability density function 3

The red line is the standard normal distribution

The standard normal distribution is the normal distribution with a mean of zero and avariance of one (the red curves in the plots to the right). According to transformationformula that will be:

2

2

23 4

1, ( ) , (0,1),

2( ) 0, 1, 0, 3

iz

ii i

Z

x Ez z e Z N

E Z

The probability density function has notable properties including: symmetry about its mean the mode and median both equal the mean the inflection points of the curve occur one standard deviation away from the

mean, i.e. at and + .

The cumulative distribution function of a probability distribution, evaluated at anumber (lower-case) x, is the probability of the event that a random variable X withthat distribution is less than or equal to x. The cumulative distribution function of thenormal distribution is expressed in terms of the density function as follows:

2121( ) ( )

2

ix E

x

i ix p X x e dx


x

~


97

The cumulative distribution function of a probability distribution, evaluated at anumber (lower-case) z, is the probability of the event that a random variable Z withthat distribution is less than or equal to z. The cumulative distribution function of thestandardized normal distribution (red line) is expressed in terms of the densityfunction as follows:

2

21( ) ( )2

izz

i iF z p z z e dz

There are tables with values of cumulative distribution function of the standardizednormal distribution.

Roles for standardized normal distribution

Roles for determination probability for different kinds of cases with standardizednormal distribution are:1. ( ) 1 ( )i ip Z z F z 2. ( ) ( ) ( )i j j ii j p z Z z F z F z 5. ( ) 1 ( )i ip Z z F z 6. ( ) ( ) ( ) 2 ( ) 1i i i i ip z Z z F z F z F z

On next two graphs we can see illustration for determination area under curve forstandardized normal distribution (probability):1. ( 1.25) (1.25)p z F


98

2.( 1.25) ( 1.25) ( 1.25) ( 1.25) 1 ( 1.25)( 1.25) 1 (1.25)

p z F p z p z p zF F

Characteristic intervals for normal distribution

If 2~N( ; )X then we have characteristic intervals for distances of one, two andthree standard deviations from the mean:

68.3%p X 2 2 95.4%p X 3 3 99.7%p X

Example 5.The tread life of a certain brand of tire has a normal distribution with mean 35000miles and standard deviation 4000 miles. For randomly selected tire, what isprobability that its life is:

a) less than 37200 milesb) more than 38000 milesc) between 30000 and 36000 milesd) less than 34000 milese) more than 33000 miles.


99

Solution:

2(35000;4000 )X NFirst we have to standardize or to transform x in z. We use

Documents

Applied Statistics En