29
Abilità Informatiche Avanzate a.a. 2008-09 lez 3 - 4 Prof. Raffaella Folgieri

Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Abilità Informatiche Avanzatea.a. 2008-09

lez 3 - 4

Prof. Raffaella Folgieri

Page 2: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Data collectionWe’ll analyse data collected by a University, for different aims: management, accounting, organizational scopes.Data have been collected by different entities, so they are not omogeneousWe have qualitative and quantitative data, so we’ll have to treat them differentlyWe need a Database Management System to manage and store data, but we’ll use statistical tools (Excel) to analyse themVariables: qualitative or quantitative characteristics collected by a generic statistical Unit (column, “ field” in IT). Statistical Unit: a row in a matrix of data (“record” in IT).

Page 3: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Objectives of the analysisIn the first part of our case studies, we’ll analyse students’data, by a socio-economic point of viewAfter, we’ll take into account the variables on schoolastic results and past carriers considered together, in order to find, if they exist, omogeneous groups of students which have common characteristics. This statistic technique is called segmentation (segmentazione) and it is useful not only to describe and explore data, but also for decision making and planning processes.

The datasets

Page 4: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Dataset description and Excel modelWe have 3 datasets, on our CD, but we are interested only in theDataset A:− In this dataset, the students’ office collects, grouped by Faculty,

anagraphical data, information about previous schools, results (espressed in CFU and scores). The data are related to the academic year 2002/03

As first we have to import data in a worksheet.1. Create a model2. Open an new Excel file3. Set the format of the sheet in the following way:4. Name the sheet “DatasetA”5. Set the colour of the first column to yellow and the first row to orange6. Block the sheet between the first row and the first column (menu

window/block view – finestra/blocca riquadri)7. Assign to charaters the dimension 8 pt8. Save the model as “MODELLO DATA SET A”

The datasets

Page 5: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Variables’ namesNow import the data from the dataset− Open the model MODELLO DATA SET.xlt− Save it with a new name− If you want, you can protect the sheet with a password

(tools/option – strumenti/opzioni)− Select the cell A1 and import the file DatasetA.txt (Data/import

external data/import data – Dati/importa dati esterni/importa dati) and follow the steps shown by the wizard (data are delimitated by tabs)

− Now order data by the Students’ ID (crescent)− Assign names to variables, through the menu

Insert/Name/define (Inserisci/Nome/definisci)− Assign a name also to the complete dataset

The datasets

Page 6: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Data format and redundancesVerify the data format. For example, the date of birth is in theformat gg/mm/aa. If we want to obtain the format gg-mm-aa, we have to select all the values of the variable (recall it fromthe menu with the variables’ names). Select format/cells/date (formato/celle/data) and choose the format. We can perform the same operation for the field “importo del reddito” to which we wish to add the simbol for the euro (€).Now we have to delete the ridondances. For example, in the variable “tipo maturità” we can delete the term MATURITA’. To do this, insert a new column after the variable’s column. Select the values of the variable (through its name) and select data/text (dati/testo) in column/fix width (colonna/larghezza fissa) and after click on END. Now hide the column “Tipo Maturità” and assign again the name to the new variable.

The datasets

Page 7: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Position indices: theoretical surveyPosition indices (indici di posizione): values used to obtain a syntesis of a distribution in frequence, to underline foundamental characteristics and to compare the distribution with other ones on the same phenomena, for example in different time or localization.

There are many postion indices, but all have a common characteristic: the internality (internalità):

− The index is always in the interval given by the minimum value x1 and the maximum value xk of the distribution.

Page 8: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Position indices: ModeIts symbol is Mo and it is the modality corresponding to the maximum frequency (one, two or more)

If more frequencies have the same high values, the mode loose its descrittive capability.

If we have distributions in frequency class (that is a continuous quantitative phenomena), we can individuate the modal class using the specific frequencies. The moda is the central value of the modal class i:

We can define the mode for each kind of phenomena

21++

= iio

hhM

Theoretical Survey

Page 9: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Position indices: MedianMedian (Mediana): set the n observations in a not-crescent order, the median (Me) is the modality that occupies the central position, dividing the distribution in two equal parts.Data are ordered by the calculation of the cumulative frequencies Ni.If n is odd, we’ll have just one median position:If n is even, we’ll have two median positions:

We call:− Rectilinear qualitative phenomena: if P1 and P2 correspond to the same

modality, the median is that modality; if P1 an P2 correspond to two different modalities, the median is indeterminated.

− Discrete quantitative phenomena: if P1 and P2 correspond to the same modality xi, we have Me=xi; if P1 correspond to the modality xi and P2 orrespons to xi+1, we’ll have:

− Continue quantitative phenomena: we consider only one median pos.

So, once individuated with P the median class (for example the calss i), we have to identify the value, internal to the class, that divides the distribution into two equal parts. In this case we have the following formula (where a is the amplitude od the median class):

21+

=nP

Theoretical Survey

21nP = 1

22 +=nP

21++

= iie

xxM

21+

=nP

)( 1−−+= ii

iie NP

nahM

Page 10: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Position indices: QuartilesThe Quartiles divide the distribution in four equal parts. In a qualititative phenomena, the first quartile, called Q1, is determinated in the position P1=(n+1)/4, the second, indicated with Q2=Me, is in the same position of the Median, in the position P2=(n+1)/2. The third quartile, Q3, is in the position P3=3(n+1)/4.The rules are the same seen for the median.The graphical representation of the quartiles is the boxplot (grafico a scatola).The distribution is symmetric if in the boxplot the distance between the first quartile and the second one is equal to the distance between the second and the third.If the distribution between Q1 and Q2 is smaller than the distance between Q2 and Q3, the distribution isd asymmetric (positive). On the contrary, if the distribution between Q1 and Q2 is bigger than that one between Q2 and Q3, we have a negative asymmetrc distribution.

Theoretical Survey

Page 11: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Position indices: Percentiles

Percentiles split a set of ordered data into hundredths.

The calculation of percentiles follows the same rules of the quartiles.

Theoretical Survey

Page 12: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Position indices: arithmetic meanthe arithmetic mean (or simply the mean) of a list of numbers is the sum of all of the list divided by the number of items in the list. If the list is a statistical population, then the mean of that population is called a population mean. If the list is a statistical sample, we call the resulting statistic a sample mean.The mean is the most commonly-used type of average and is often referred to simply as the average. The term "mean" or "arithmetic mean" is preferred in mathematics and statistics to distinguish it from other averages such as the median and the modeThe weighted mean is similar to an arithmetic mean (the most common type of average), where instead of each of the data points contributing equally to the final average, some data points contribute more than others. If all the weights are equal, then the weighted mean is the same as the arithmetic mean.

Please, refer to the class book to repeat properties and theorems on arithmetic mean.

Theoretical Survey

Page 13: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Variability indicesIn statistics, statistical dispersion (also called statistical variability or variation) is variability or spread in a variable or a probability distribution. Common examples of measures of statistical dispersion are the variance, standard deviation and interquartile range.Dispersion is contrasted with location or central tendency, and together they are the most used properties of distributions.A measure of statistical dispersion is a real number that is zero if all the data are identical, and increases as the data becomes more diverse. It cannot be less than zero.Most measures of dispersion have the same scale as the quantity being measured. In other words, if the measurements have units, such as metres or seconds, the measure of dispersion has the same units. Such measures of dispersion include:− Standard deviation− Interquartile range− Range− Variance (the square of the standard deviation) — location-invariant but not

linear in scale

Theoretical Survey

Page 14: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Variability indices: rangeIn descriptive statistics, the range is the length of the smallest interval which contains all the data. It is calculated by subtracting the smallest observation from the greatest and provides an indication of statistical dispersion.It is measured in the same units as the data. Since it only depends on two of the observations, it is a poor and weak measure of dispersion except when the sample size is large.For a population, the range is more than twice the standard deviation.The range, in the sense of the difference between the highest and lowest scores, is also called the crude range. When a new scale for measurement is developed, then a potential maximum or minimum will emanate from this scale. This is called the potential (crude) range. Of course this range should not be chosen too small, in order to avoid a ceiling effect. When the measurement is obtained, the resulting smallest or greatest observation, will provide the observed (crude) range.The midrange point, i.e. the point halfway between the two extremes, is an indicator of the central tendency of the data. Again it is not particularly robust for small samples.

Theoretical Survey

Page 15: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Variability indices: standard deviationIn probability and statistics, the standard deviation is a measure of the dispersion of a collection of numbers. It can apply to a probability distribution, a random variable, a population or a data set.The standard deviation is usually denoted with the letter σ (lowercase sigma). It is defined as the root-mean-square (RMS) deviation of the values from their mean, or as the square root of the variance.Formulated by Galton in the late 1860s, the standard deviation remains the most common measure of statistical dispersion, measuring how widely spread the values in a data set are. If many data points are close to the mean, then the standard deviation is small; if many data points are far from the mean, then the standard deviation is large. If all data values are equal, then the standard deviation is zero.A useful property of standard deviation is that, unlike variance, it is expressed in the same units as the data.

Theoretical Survey

Page 16: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Variability indices: interquartile rangeIn descriptive statistics, the interquartile range (IQR), also called the midspread, middle fifty and middle of the #s, is a measure of statistical dispersion, being equal to the difference between the third and first quartiles.Unlike the (total) range, the interquartile range is a robust statistic, having a breakdown point of 25%, and is thus often preferred to the total range.The IQR is used to build box plots, simple graphical representations of a probability distribution.For a symmetric distribution (so the median equals the midhinge, the average of the first and third quartiles), half the IQR equals the median absolute deviation (MAD).The median is the corresponding measure of central tendency.

Theoretical Survey

Page 17: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Variability indices: varianceIn probability theory and statistics, the variance of a random variable, probability distribution, or sample is a measure of statistical dispersion, averaging the squared distance of its possible values from the expected value(mean).The unit of variance is the square of the unit of the original variable. The positive square root of the variance, called the standard deviation, has the same units as the original variable and can be easier to interpret for this reason.The variance of a real-valued random variable is its second central moment, and it also happens to be its second cumulant. Just as some distributions do not have a mean, some do not have a variance. The mean exists whenever the variance exists, but not vice versa.

Theoretical Survey

Page 18: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Case study 3: students’ anagraphical characteristics

In the dataset A we don’t have the field “age”, needed to contruct and calculate the frequence distribution and the syntesis index.Moreover, we have the variable “date of birth” (data di nascita)To perform the following operations, use the worksheet Obiettivo2vuoto.xls on the class book’s CD

1. Insert a column (menu Insert/columns) after the column C2. Choose the function year(C2) [anno(C2)]. After formatting the cell

as a number, with 0 decimal positions. Drag the formula to the cell under the first one.

3. Now insert a new column after the previous one, and label it “Età”(Age). Now, considering that we are analysing data related to 2002, insert in the first cell the formula = 2002 - D2. Drag the formula to the following cell of the column.

4. Assign a name to the new inserted variables.

Case study 3

Page 19: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

The distribution of the variable Age (“età”): the pivot table

1. Select the cell A7 in the sheet Età and insert a pivot table using the command Data/Pivot table and pivot graphic (Dati/rapporto tabella pivot e grafico pivot)

2. At the step 2, remember to select the whole dataset (datasetA)3. At the step 3, choose to put the pivot table in the cell A7 of the

sheet Età.

Case study 3

Page 20: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

The distribution of the variable Age (“età”): the pivot table

4. Choose “layout” and put “età” on row and “ID” in column (automatically Excel set the function “sum”)

5. Click Ok, and after end the wizard.If you want to change something, you can access the window PivotTable fields and use the PivotTable toolbar.

Case study 3

Page 21: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Evaluating the use of a class distribution for the variable Age

In the table we can see that the distribution of the variable Age (età) presents a long tail of high values. For this reason is better to group its values in classes.In the cell F14, calculate the number of students which age is <= 30, using the formula Sum.IF (Somma.se)

Case study 3

More than the 90% of the total!.. So we can choose to set 6 classes, as follow…

Page 22: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

How to create the variable Age Classes

We have to insert a new column after “età”The function to use is

the function “IF” (SE),We have to nest more

functions SE to obtain the result:

After, drag the formuladown to the other cellsof the column

Case study 3

=SE(E2<=20,"minore o uguale a 20",SE(E2<=22,"20-|22“,SE(E2<=24,"22-|24",SE(E2<=27,"24-|27",SE(E2<=30,"27-|30","oltre 30")))))

Page 23: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Frequence distribution of the variable Age Classes

Insert in the cell B62 the function COUNT.IF (CONTA.SE):

Case study 3

In the cell B68 put the sum of the absolute frequencies using the function SUMUnder fi, put the formula to calculate the relative frequencies: =B62/$B$68In the column Lim.Inf. Put the formula to calculate the inferior limit: =MIN(Età)

In the following column put the values for the superior limit, using the function RIGHT (DESTRA) on the string that identify the classAfter put, in the order and using the known formulas: amplitude=E62-D62; Specific Frequence FSpec=B62/F62; central value of the class xi=(F62)/2+D62; Cumulated frequencies Ni=B62; arithmetic mean xi*ni=H62*B62 and variance xi2*ni=(H62^2)*B62

Page 24: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Position indicesWe have a distribution by frequency classes, so we must calculate the mode as the modal class (max specific frequency) and the modal value that is the central value of the class. So in the cell B75 we’ll use the function FIND.VERT (CERCA.VERT): =CERCA.VERT(MAX(G62:G67),G62:H67,2,FALSO)

the medial position have been obtained using =B68/2

The median class is given by =INDICE(A62:A67,(CONFRONTA(B78,I62:I67,1))+1,1)Found the value of the median using the formula: =D63+((F63*(B78-I62))/B63)=J68/B68 is used to find the means

=(K68/B68)-(B82^2) allows to calculate the variance

Case study 3

se abbiamo una tabella organizzata verticalmente, CERCA.VERT prende ivalori da tale tabella in funzione di un parametro scritto in una cella.

Page 25: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Bivariated tableNow we can fill the bivariate table in A113:H123 (tab “ETA’”). We have to

insert the formulas:

=DB.CONTA.VALORI(DataSetA;$B$112;A92;B93) in the cell B114

=DB.CONTA.VALORI(DataSetA;$B$112;A94;B95) in the cell B115

=DB.CONTA.VALORI(DataSetA;$B$112;C92;D93) in the cell C114

To perform this operation we can also use a MACRO (but this is not the scope of

this course)

In next slide we’ll see how to find and analyse the characteristic values of

the variable Number of Family Component (numero di componenti del nucleo

familiare) and Income Range (Fascia di Reddito) for the University and for

each Faculty.

Case study 3

Page 26: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Frequence distributions for the variable “family component number”

by Faculty and University and hot to calculate the position indicesAlso in this case, we’ll use the function DB.CONTA.VALORI, after filling the area “Comp Nucleo Fam” on the sheet, with the right criteria (10 values for the family component*9 values for the Faculties=90 criteria)

Case study 3

Page 27: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Frequence distributions for the variable “family component number”

position indicesIt is not necessary to contruct a class distribution for the variables, so we don’t have the calculation for the limits of the classes, for the amplitude, for the specific frequence and for the central value. We just need to calculate:− fi(100), that is the % of the relative frequencies− Ni, the cumulative frequencies− xi*ni, the sum of the weighted modality− Xi

2*ni, the sum of the squared observation, weighted with the absolute frequencies

We have to repeat this calculations for the University and for each Faculty, so we can also write a macro in Excel.For the mode, use the function CONFRONTA( ) and INDICE( ), as for the median For ex., for the mode=INDICE(A49:A58;(CONFRONTA(MAX(C49:C58);C49;C58;0));1)For the median of the Faculty 1, we’ll have=INDICE(A49:A58:(CONFRONTA(B124;D49:D58;1)+1);1)See table from row 122 on the sheet “comp nucleo fam”

Case study 3

Page 28: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Frequence distributions for the variable “Income Range”(Fascia di Reddito)

by University and single FacultyIn this case we’ll choose a different methodSelect the cell A9 of the sheet “Fascia di Reddito” and create a pivot table:− Based on the range A1:V2905 (the whole dataset)− With “Fascia di Reddito” as row field− With the variable “ID” as data field

The summary function is CONTEGGIO and the name of the default data field will be substituted by nijA table pivot is the quickest way to create a bivariated table (instead of using DB.CONTA.VALORI)Now we can copy the Faculties distributions in the related tables (in this case without data transposition) Also in this case, we can create an Excel macro to accelerate the operation

Case study 3

Page 29: Abilità Informatiche Avanzate a.a. 2008-09 · 2009-05-19 · Prof. Folgieri – aa 2008/09 Objectives of the analysis In the first part of our case studies, we’ll analyse students’

Prof. Folgieri – aa 2008/09

Frequence distributions for the variable “Income Range”(Fascia di Reddito)

position indicesFor the calculation of the position indices we can follow the same instructions given for the family component numberSee table from row 103 on the sheet “fascia di reddito”

Case study 3