27
Abilità Informatiche Avanzate a.a. 2008-09 lez 5 Prof. Raffaella Folgieri

Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Abilità Informatiche Avanzatea.a. 2008-09

lez 5

Prof. Raffaella Folgieri

Page 2: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Objectives of the lessonIn this lesson we’ll take into account the variables on schoolastic results and past carriers in conjunction, in order to find, if they exist, omogeneous groups of students which have common characteristics.

This statistic technique is called segmentation(segmentazione) and it is useful not only to describe and explore data, but also for decision making and planning processes,

The datasets

Page 3: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Double statistic variablesWhen we consider two characteristics or phenomena on the same population unit, we usually are interested in the relationship between them

If X and Y are the two variables, and n is the number of population units, if n is enough high, each characteristic interact with the other one.

To highlight this fact, the best tool is the double entry table, in which we have one of the variable on the column and the other on the row. At the crossing of the two variables, we’ll indicate the absolute frequency.

The frequency distribution of a double statistic variable is given by the triple (xi,yi,nij)

If the phenomena are both qualitative or one qualitative and one quantitative, the table is named contingency table (tabella di contingenza)

If they are both quantitative, it is a correlation table (tabella di correlazione)

The first and the last row of the table show the frequency distribution for the phenomenon y, while the first and the last column show the frequency distribution for x. These two unidimensional distributions are named marginal (marginali)

Theoretical Survey

Page 4: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Double statistic variablesThe following table contains other unidimensional distributions named conditional distributions (distribuzioni condizionate), obtained joining to the modality, the related frequencies for the other variable.

Theoretical Survey

Page 5: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Double statistic variablesTo compare the conditional distributions, we need to use the relative row or column frequencies, obtained dividing each absolute frequency by the row or column total, as shown in this table (for the conditional distribution of X)

Theoretical Survey

Page 6: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Independence between two phenomenaIf the conditional distribution (expressed by the relative frequencies) are the same, they coincide also for the relative marginal distributions and the two phenomena are called “independent” (indipendenti).

In this case we consider the theoretical frequencies.

The differences between the absolute (or relative) frequencies, observed and theoretical, are called respectively absolute or relativecontingencies.

The contingencies distributions (which have sum equal to null for each row and each column) in a table, give information about actraction(attrazione) and about rejection (“repulsione”, negative contingencies)

between the modality of the two considered characters

Theoretical Survey

Page 7: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Pearson connection indexIf the contingencies are null, we have the independence, while we move away from the independence if the value of the contingencies grows.

Among all the indices used to measure the connection level, there is the Pearson index.

If we fix a marginal (row or column) in a squared table (h=k) we have the max connection when each modality of a variable is associated to only one modality of the other one, as in the example shown in the following figure:

Theoretical Survey

Page 8: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Use of Excel to treat double variablesWe want to analyse the academic profile of each student.To this aim, we have to create and analyse some variables.As first, we need to individiate the variable that indicates the years past from the diploma till the current date for each student (Anni intercorsi)Thank to the formula LEFT (SINISTRA) we can obtain the year in which the student began the University studies, by the column Anno Immatricolazione− In the cell P2, put the formula =SINISTRA(K2,4)-O2− after drag the cell down, to all the cells under the first one− In this way we otain the values for the column “anno di maturità”− The quickest way to create the frequency table is reating a pivot table, on

the cell range A1: S2905, the variable “Anni intercorsi” as row field and the variable ID as data field.

− For the characteristic values,use the Excel functions, asshown in figure

Case study 4

Page 9: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Use of Excel to treat double variablesExcel formula for calculating the variance (for Anni Intercorsi)

You can use− Var( ) and Var.valori( ), used if you have to calculate the variance in a

sample. Var.valori( ) allows to include also logical and text values− Var.pop( ) is used to reach the objective, and returns the variance on the

base of the whole population. Arguments must be number or reference to number. If you wanto to take into account also logical and text values, you have to use Var.pop.valori( )

The frequence distribution for the variable “Anni intercorsi” presents a long right tail and the median and the arithmetic mean dont coincide, so they don’t represent the center of the distribution, so we can say that we have a positive asymmetrc distribution− We can quickly find the frequence distribution using the function

FREQUENZA(Matrice_dati;Matrice_classi), where:− Matrice_dati contains the reference to a set of values in which we want to calculate

the frequence− Matrice_classi is a reference to the range of cells that contains the superior class

limit, in our case the range A52:A54For us the formula will be: =FREQUENZA(Anno_corso;A52:A54). You have to click CTRL+ALT+ENTER before writing the “=“. In this way the calculation and the formula will be applyed to all the selected range (matrix calculation)

Case study 4

Page 10: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Use of Excel to treat double variablesExcel formula for calculating the variance (for Anni Intercorsi)

In column C and D in rows 52, 53 and 54, insert the fomula for the relative frequencies and for the cumulative frequenciesThe mode correspond to the modality with the max relative frequence, so we just have to test if the relative frequency corresponding to the fisrt, second or third modality is the max observed. So, we need to insert in the cell G51the formula =SE(B52=MAX(B52:B54),A52,SE(B53=MAX(B52:B54),A53,A54))

Case study 4

Page 11: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Use of Excel to treat double variablesExcel formula for calculating the frequency distribution (for Tipo Iscrizione)

You can use the function− CONTA.SE( ) because we have few answer modalities (The function

FREQUENZA is not applyable because we have text values)− Insert in the cell B62 of the sheet Tappa A Distr - Val Car the formula

=CONTA.SE(Tipoiscr,A62)− Copy the same formula also in the cells A63 and A64− To calculate the characteristic values, insert in the rang C62:C64 and

D62:D64 rispectively the relative and the cumulative frequenceis and calculate the mode and the median

Now, in the sheet DataSetA, we have to round the variale Voto_madio, using the function ARROTONDA( ) (in English: ROUND)− Now assign to the new variable, the same name “Voto medio”− To define the frequency distribution, use the function FREQUENZA( ), applyed to the

cell range B72:B84To obtain mode, median, means and variance, follow the formula shown in figure:

Case study 4

Page 12: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Use of Excel to treat double variablesEvaluation of the opportunity to create modal classes (for Crediti)

For the variable Crediti, we want to evaluate the opportunity to create modal classes. To do this, we’ll use a pivot table, in which the row field will be the number of credits (variable Crediti) and the variable ID as data field. As summary function, use the function conteggio, and insert the pivot table in hte cell A91 of the sheet “Passo A Distr – Val Var”The variable presents a distribution with many modalities characterised by low frequencies. So, in this case, it is better to code into modal classes the variable numero crediti formativi acquisitiTry to perform this coding, as an exercise (remember previous lessons)

Case study 4

Page 13: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Graphical representation of the variablesSelect the cells range A8:B45 of the sheet Tappa A – Distr Val Cal. If we select a pivot table for a graphic, we’ll obtain an interactive visualization.Through the pivot graphic bar, you can change the visualization, choosing, for example, a pie, as in figure

Case study 4

Page 14: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Graphical representation of the variablesNow, do the same for the ranges A61:A65 and C61:C65 (click CTRL while selecting the ranges). The best choice is the Instogram

Case study 4

Page 15: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Graphical representation of the variablesTo represent the variable Voto_medio, use a not-in-pile instogram and choose the cell range A71:B84

Case study 4

Page 16: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Independent variablesAs first, consider to create the new variable “Rendimento”as follows (we need a series of nested SE):− Insert in the cell U2 of the sheet DataSetA the formula

=SE(E(T8>18,T8<=21),"Sufficiente",SE(T8<=24,"Discreto",SE(T8<=27,"Buono",SE(T8<=30,"Ottimo",""))))

− drag the formula to the cells under U2

Now create a list to order the variables in a new mode:− Select the menu TOOLS/OPTIONS (Strumenti/Opzioni) and choose

the folder LIST (Elenchi). Add a custom list.− Now click OK and order (menu DATA/ORDER BY) the whole

dataset by the new defined variable

Case study 5

Page 17: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Independent variablesNow we want evaluate the existence of a relationship between the variable “Reddito” and the variable “rendimento accademico”.We’ll use a pivot table, based on the whole dataset, with Fascia di reddito as row field, Rendimento as column field, ID (and name it nij)as data field and conteggio as summery functionPut the pivot table in the cell A26 of the sheet Passo C e D

In the table there aren’t zeros, so the two considered variable could be independent

Case study 5

Page 18: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Independent variablesWe can verify the stocastc independence comparing the conditional frequencies for row and column and through a graphical representationCopy the obtained pivot table in the cells A43 and A62

Case study 5

In the first table select the data field (by the cell labeled nij) and modify campo pivot tableselecting % di riga and setting the name as Condizionale di rigaDo the same with the other table for Condizionale di colonnaCreate the correspondig graphics, using not-in-pile bar chart for the first serie of data, and not-in-pile instogram for the second.The value of “condizionale di riga” and “condizionale di colonna” are different, so, by the observation of the tables and of the graphics, wecan conclude that there isn’t stocastic endependency between the variables “Fascia di Reddito” and “Rendimento”

Page 19: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Pearson indicesNow we wanto to prepare data to apply the Pearson index (Chi-quadrato)Copy the pivot table in figure, to the cell A106, A126, A143 and A158 of the same sheet

Case study 6

Page 20: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Pearson indicesDon’t change the table in A106

Case study 6

Page 21: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Pearson indicesIn the table in cell A126, insert the formula to obtain the theoretical frequencies, applying the formula:

Case study 6

nnn

n jiij

⋅⋅=ˆ

Page 22: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Pearson indicesIn the table in cell A143, we can obtain the table of contingencies with the formula:

Case study 6

ijijij nnc ˆ−=

Page 23: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Pearson indicesIn the table in cell A158 , we’ll caluclate the quantities:

Case study 6

ijij nc ˆ/2

Page 24: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Chi-squared and normalization (indice chi-quadrato e normalizzazione)

We have to compile the cell shown in figure, with the appropriate Excel function:

The value of the normalized index (0,004) reveals the presence of a stocastic independence between the variables “Fascia di Reddito” and “Rendimento”

Case study 6

Page 25: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Sum.If (Somma.Se) Count.If (Conta.se)

Min (Min) Max (Max)

Right (Destra) Left (Sinistra)

Find.Vert (Cerca.Vert) Index (indice)

DB.Count.Val (DB.Conta.Valori) Mode (Moda)

Median (Mediana) Average (Media)

Frequenza (Frequence) Var (Var)

Var.Values (Var.Valori) Var.Pop (Var.Pop)

Excel Functions used in lessons 3,4, 5Summary of used functions

Page 26: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

Main objectives:Advanced skills usage of spreadsheets and databasesAimed at enhancing problem solving capabilities instead of package knowledgeThree areas of application:− Economy− Medicine and Pharmacy;− Statistics for social sciencesTeaching material and books produced

The IT4PS certification projectSummary of used functions

Page 27: Abilità Informatiche Avanzate a.a. 2008-09 · 2009. 5. 19. · Prof. Folgieri – aa 2008/09 Pearson connection index If the contingencies are null, we have the independence, while

Prof. Folgieri – aa 2008/09

How to:1 hour test to solve a real problem - in one of the application areas – with advanced usage of spreadsheet or databaseresult-based vs. solution-based evaluationTested mid 2007 and launched as an Italian certification (AICA)Final goal: to export IT4PS at the European level through CEPIS (just done)

The IT4PS certificationSummary of used functions