Statistics

INTRODUCTION TO INTRODUCTION TO MULTIVARIATE MULTIVARIATE

STATISTICSSTATISTICS

Dr. Debdulal Dutta Roy, Ph.D. (Psy.)Dr. Debdulal Dutta Roy, Ph.D. (Psy.)Psychology Research UnitPsychology Research UnitIndian Statistical InstituteIndian Statistical Institute

Kolkata – 700 108Kolkata – 700 108E-mail:[email protected] (o)E-mail:[email protected] (o)

[email protected]@yahoo.comhttp://www.isical.ac.in/~ddroy/abstract.htmlhttp://www.isical.ac.in/~ddroy/abstract.html

mailto:[email protected]

The InstituteThe Institute• Indian Statistical Institute (ISI), a unique

institution devoted to the research, teaching and application of statistics, natural sciences and social sciences. Founded by Professor P.C. Mahalanobis in Kolkata on 17th December, 1931, the institute gained the status of an Institution of National Importance by an act of the Indian Parliament in 1959.

• Research in Statistics and related disciplines is the primary activity of the Institute. Teaching activities are undertaken mainly in Kolkata, Delhi and Bangalore.

MYTHSMYTHS

• Statistical treatment of more than 2 variables is multivariate statistics;– No, when more than 2 variables are

interrelated with each other, we can use multivariate statistics.

Myth 2Myth 2

• Purpose of multivariate statistics is to establish correlation among sets of variables.– True. But it’s purpose is not limited in

determining relation among set of variables. It tends to control the effect of some intervening variables on relationship among sets of variables.

Myth 3Myth 3

• Loss of original score – Accepted, if analysis extracts more latent

properties within the variable.

What is MVS ?What is MVS ?

• MVS refers to the set of statistical tools in order to find out pattern of relationship among the set of variables – Independent, dependent and intervening variables.

• The definition suggests that MVS can not be used when the variables are not correlated with each other.

• Therefore, before going for MVS, it is necessary to do correlation among them.

List of Multivariate Statistical ToolsList of Multivariate Statistical Tools

Can we assess their perception, beliefs and attitudes ?

List of Multivariate Statistical ToolsList of Multivariate Statistical Tools

• Determining differences among groups :– ANOVA with Factorial design;– MANOVA;– Discriminant Function Analysis;

• Determining structure of relationship :– Multiple Regression – Canonical Correlation– Principal component analysis;– Correspondence analysis;– Cluster analysis;

List of Multivariate Research List of Multivariate Research Questions onQuestions on

Women and Child development Women and Child development

Multivariate Research Questions for Women & Multivariate Research Questions for Women & Child Development (Difference Perspective)Child Development (Difference Perspective)

• Multiple Regression– What is the predictive strength of poverty, inequality, war,

criminal networks, ruthless demand for cheap labour and commercial sexual exploitation in predicting motivation to human trafficking ?

• Factorial Design :– Does eating habit (DV) of women vary with age, education

and socio-economic status (IV) ?

1. MANOVA– Does food belief of pregnant mother vary with religion ?

• Discriminant function analysis ?– What is the predictive capacity of food attitude questionnaire

to classify students in terms of their mid-day meal taking ?

Multivariate Research Questions for Women Multivariate Research Questions for Women & Child Development (Relation Perspective)& Child Development (Relation Perspective)

• Canonical correlation – Is there any relation between awareness of nutrition

and motivation to follow good food taking habits ?• Principal Component analysis

– What is the meaning of good food taking motivation ?• Correspondence analysis

– Can we map different districts of one state in terms of human trafficking ?

• Cluster analysis – Is it possible to classify states in terms of immunization

?

Some studies on Application of Some studies on Application of Multivariate statisticsMultivariate statistics

Principal Component AnalysisPrincipal Component Analysis

• Principal component analysis is a technique (1) to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify variables. Therefore, PCA is applied as a data reduction or structure detection method. In principal component analysis, we seek linear composites of the original variables that display certain desirable properties, namely, scores that exhibit maximal variance, subject to being uncorrelated with previously computed composites.

Study on Principal Component Study on Principal Component AnalysisAnalysis

• Purpose: To determine principal components of school infrastructure perception.

• Assumption: School infrastructure perception encompasses set of 13 variables – perception of students to school infrastructures as Classroom, Drinking Water, Toilet, Blackboard, Teachers, Book, Teaching learning materials, Friends, Games, Cultural Programs, Book bank, Mid-day Meal, and Health Check-up. Most of these infrastructures are available in rural schools. It is assumed that there would be some latent structure in perception of 13 variables.

1 2 3 4 5 6 7 8 9 10 11 12 13

1 Classroom 1

2 Drinking Water 0.25** 1

3 Toilet 0.28** 0.61** 1

4 Black board 0.33** 0.09*0.22*

* 1

5 Teaching 0.38** 0.22**0.27*

* 0.38** 1

6 Book 0.36** 0.11*0.23*

* 0.41** 0.43** 1

7 TLM 0.13** -0.06 0.05 0.20** 0.24** 0.23** 1

8 Friend 0.32** 0.30**0.34*

* 0.21** 0.29** 0.31** 0.24** 1

9 Games 0.08* 0.09* 0.11* 0.14** 0.24** 0.21** 0.33**0.31*

* 1

10Cultural programme 0.11* 0.05 0.07 0.14** 0.34** 0.21** 0.24**

0.27** 0.30** 1

11Book bank 0 0.17**

0.20** -0.02 -0.01 -0.04 0.07

0.18** 0.15** 0.05 1

13Health checkup 0.33** 0.21**

0.29** 0.20** 0.22** 0.11* 0.01

0.25** 0.04 -0.03

0.08*0.19** 1

Significant correlations suggest possible latent structure Significant correlations suggest possible latent structure of relationship signifying latent meaning of relationship signifying latent meaning

Extraction of Factors using PCAExtraction of Factors using PCA

Infrastructures Basic Infrastructure Supportive Infrastructure Activity based Infrastructure

Class room 0.62 0.35 -0.03

Drinking Water 0.06 0.81 -0.03

Toilet 0.2 0.79 0.03

Black board 0.7 0.06 0.1

Teaching 0.63 0.19 0.29

Book 0.68 0.07 0.25

TLM 0.21 -0.09 0.63

Friend 0.24 0.49 0.44

Games 0.07 0.11 0.74

Cultural Programme 0.19 -0.01 0.63

Book bank -0.35 0.47 0.36

Mid-day meal 0.49 -0.02 0.26

Health checkup 0.38 0.46 -0.17

Eigen Value 3.5 1.67 1.34

Plot of Eigenvalues Plot of Eigenvalues

Limitation of PCALimitation of PCA

• PCA is applicable when variables are measured in terms of Interval and Ratio scales.

• When variables are measured in terms of nominal or categorical scale, Correspondence analysis is useful statistical tool.

Correspondence AnalysisCorrespondence Analysis• Correspondence analysis is an exploratory multivariate

technique that converts frequency table data into graphical displays in which rows and columns are depicted as points. It provides a method for comparing row or column proportions in a two-way or multiway table. CA investigates the magnitude and the substantive nature of association between the row and column categories of cross tabulation rather than to confirm or reject hypothesis about the underlying process.

• These methods were originally developed in France by Jean-Paul Benzerci in the early 1960’s and 1970’s and it has gained importance in the classic text by Greenacre (1984).

• Other names : correspondence mapping, perceptual mapping, social space analysis, correspondence factor analysis, principal components analysis of qualitative data, and dual scaling;

• Types : Simple and Multiple.

Study on CAStudy on CA

• Purpose: To determine correspondence between computer programming tasks and relative use.

• Assumptions : Uses of 14 computer programming tasks vary.

• Data Sets : 14 programming tasks (row variables) and 5 ratings of use (column variables).

INPUT TABLE FOR CAINPUT TABLE FOR CA

Cross Tabulation of 5 Cols. (Rating categories) X 14 Rows (Computer programming tasks).

Assumption: Some tasks are related with each other and some of them are more frequently used and some are used less frequently.

Correspondence Map of 14 Row and 5 Col. variables

Cluster Analysis Cluster Analysis

• Cluster analysis helps to identify similar entities on the basis of characteristics they possess. It helps to classify objects or variables having functional homogeneity. The resulting object clusters should exhibit high internal homogeneity (within cluster) and high external homogeneity between any two clusters. It is an inductive treatment and a purely empirical method of classification.

Tree diagram based on cluster Tree diagram based on cluster analysisanalysis

Tree Diagram for students of all communities in Tripura based on Reading Motive

Complete Linkage

Euclidean distances

Linkage Distance

OTH_TR

HALAM

JAMATIA

REANG

NON_TR

CHAKMA

TRIPURI

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

MANOVAMANOVA

• MANOVA is a tool to determine significant differences in one correlated variables among the groups.

Variables Wilks’ Lambda

Rao’s R df P-value

Gender 0.97 1.76 9,558 NS

Religion 0.92 5.06 9,532 0

S-E-S 0.9 3.03 9,242 0

District 0.35 14.44 452,499 0

Blocks 0.9 6.49 9,562 0

School 0.77 5.13 9,153 0

Fisher’s Linear Discriminant Functions for differentiating Fisher’s Linear Discriminant Functions for differentiating Schools with Good and Poor InfrastructureSchools with Good and Poor Infrastructure

Attitudinal Dimensions

Good Infrastruct

ure

Poor Infrastructu

re

Cleanliness 5.05 -0.46

Safety -1.27 -0.45

Comfort 4.93 3.76

Reliability 6.2 4.51

Equal Opportunity 2.58 2.03

Constant -26.27 -15.57

Eigen Values

Canonical Correlatio

nWilk's

LambdaChi-

Square Df P-Value

0.9 0.687 0.53 101.41 5 0

Classification Matrix of Good and poor schools in terms of Classification Matrix of Good and poor schools in terms of infrastructure availabilityinfrastructure availability

Original GroupPredicted Group Good Infrastructure

Predicted Group Poor Infrastructure Total

Good Count 75 10 85

Poor 18 60 78

Total 93 70 163

Good Percentage 88.2 11.8 100

Poor 23.1 76.9 100

Correct Classification Percentage= (75+60)/163 x 100=82.8

Box-plot Analysis of Discriminant Scores Box-plot Analysis of Discriminant Scores between Good and Poor Infrastructure Schools.between Good and Poor Infrastructure Schools.

7093N =

Predicted Group for Analysis 1

4.003.00

Dis

crim

inan

t Sco

res

from

Fun

ctio

n 1

for A

naly

sis

1

4

3

2

1

0

-1

-2

-3

-4

-5

124

Some of my studies on MVS Some of my studies on MVS • Dutta Roy, D. (2007). Taxonomic approach in Job analysis. Psychological

assessment in Personnel selection. In Dr. S. Subramony and S.B.Raj (Eds.), Psychological assessment in Personnel Selection. Delhi: Defense Institute of Psychological Research, p.25-39

• Dutta Roy,D.(2006). Clusturing academic profiles of tribal and non-tribal school students of Manipur. Journal of Psychometry, 20,2, 1-12.

• Dutta Roy,D.(2006). Clusturing academic profiles of tribal and non-tribal school students of Manipur. Journal of Psychometry, 20,2, 1-12.

• Dutta Roy, D. (2002) Personality differences across four metropolitan cities of India, Indian Psychological Review, 58,2,71-78.

• Dutta Roy.D. and Bannerjee,I.(1998) Correspondence analysis between stimulus length and amount of forgetting in assessment of short term memory

span,Indian Journal of Psychometry and Education, 29,1,7-12

http://www.isical.ac.in/~ddroy/abstract.html#Personality%20differences%20across%20four%20metropolitan

http://www.isical.ac.in/~ddroy/abstract.html#Correspondence%20analysis%20between%20stimulus%20length



Thank YouThank You