Click here to load reader

Multivariate Statistical Analysis Using Statgraphics ... · PDF fileMultivariate Data Analysis Using Statgraphics Centurion: Part 2 Dr. Neil W. Polhemus Statpoint Technologies, Inc

  • View
    240

  • Download
    2

Embed Size (px)

Text of Multivariate Statistical Analysis Using Statgraphics ... · PDF fileMultivariate Data Analysis...

  • Multivariate Data Analysis Using

    Statgraphics Centurion: Part 2

    Dr. Neil W. Polhemus

    Statpoint Technologies, Inc.

    1

  • Multivariate Statistical Methods

    The simultaneous observation and analysis of more than one response variable.

    *Primary Uses

    1. Data reduction or structural simplification

    2. Sorting and grouping

    3. Investigation of the dependence among variables

    4. Prediction

    5. Hypothesis construction and testing

    *Johnson and Wichern, Applied Multivariate Statistical Analysis

    2

  • Data set: U.S. Demographics

    3

  • Variables

    Education: percent completing college

    Income per Capita: in 2011 dollars

    Urban Percentage: percent living in urban areas

    Population Density: people per square mile

    Non-English Language: percent that dont speak English

    Median Age: in years

    Unemployment Rate: for 2012

    Percent Female: of general population

    Crime Rate: violent crimes per 100,000

    Manufacturing: % of GDP

    Infant Mortality: 2006

    Poverty Rate: percentage below poverty level

    Election 2012: results of 2012 presidential election (Obama or Romney)

    4

  • Urban Percentage

    5

  • Percent Female

    6

  • Poverty Rate

    7

  • Star Glyphs

    8

  • Data Input

    9

  • 10

  • Cluster Analysis

    Method for assigning objects to groups so that objects in the same group are more similar to each other than they are to objects in other groups.

    Some questions:

    1. How many natural groups are there?

    2. Which objects belong to each group?

    11

  • Data Input

    12

  • Analysis Options

    13

  • Dendrogram Nearest Neighbor

    14

  • Dendrogram Furthest Neighbor

    15

  • Agglomeration Distance Plot

    16

  • Dendrogram - 4 Clusters

    17

  • Save Cluster Numbers

    18

  • Map of Cluster Numbers

    19

  • Method of k-Means

    1. Starts with a set of typical objects (seeds) for each cluster.

    2. Each object is assigned to group with nearest seed.

    3. Centroids of groups are computed.

    4. Objects are assigned to different groups if nearer to a different centroid.

    5. Steps 3 and 4 are repeated until no change.

    20

  • Method of k-Means

    21

  • Seeds

    22

    Row 5 = California Row 32 = New York Row 34 = North Dakota Row 43 = Texas

  • Map for k-Means

    23

  • Classification

    The classification problem deals with the assignment of objects to known groups. Goals include:

    1. Understanding what differentiates the groups.

    2. Predicting group membership for unclassified objects.

    24

  • Methods

    Discriminant Analysis (on the Centurion Relate menu) based on Fishers linear discriminant functions.

    Neural Network Classifier (on the Centurion Relate menu) based on a Bayesian classifier with priors and costs.

    Uniwin (companion program to Statgraphics) includes methods for dealing with qualitative factors.

    25

  • (Linear) Discriminant Analysis

    Constructs linear functions of the standardized classification variables:

    These functions are derived so as to maximize the separation of the groups. If there are g groups, there are g-1 discriminant functions.

    26

    pjpjjj ZdZdZdD ...2211

  • Example

    27

  • Data Input

    28

  • Analysis Options

    29

  • All Variables

    30

  • Discriminant Function Plot

    31

  • Backward selection

    32

  • Classification Functions

    These functions create a score for each group:

    Objects are assigned to whatever group has the largest value of (Cj * priorj) where priorj is the prior probability of belonging to group j.

    33

    02211 ... jpjpjjj cXcXcXcC

  • Classification Functions

    34

  • Classification Summary

    35

  • Classification Table

    36

  • Validation Set

    May leave some of the objects out of the training set and use them to validate the results.

    37

    Correctly classified 8 out of 10 omitted states.

  • Neural Network Classifier

    Implements a nonparametric method for classifying objects based on the product of 3 quantities:

    1. The estimated density function in the neighborhood of the object

    (given a specified value of s).

    2. The prior probabilities of belonging to each group.

    3. The costs of misclassifying cases that belong to a given group.

    38

  • Bivariate Density (s = 0.5)

    39

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0123

    45

    (X 0.001)

    de

    ns

    ity

    Romney

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0123

    45

    (X 0.001)

    de

    ns

    ity

    Romney

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0123

    45

    (X 0.001)

    de

    ns

    ity

    Romney

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0123

    45

    (X 0.001)

    de

    ns

    ity

    Romney

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0123

    45

    (X 0.001)

    de

    ns

    ity

    Romney

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0123

    45

    (X 0.001)

    de

    ns

    ity

    Romney

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0123

    45

    (X 0.001)

    de

    ns

    ity

    RomneyObama

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0

    1

    2

    3

    4(X 0.001)

    de

    ns

    ity

    Obama

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0

    1

    2

    3

    4(X 0.001)

    de

    ns

    ity

    Obama

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0

    1

    2

    3

    4(X 0.001)

    de

    ns

    ity

    Obama

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0

    1

    2

    3

    4(X 0.001)

    de

    ns

    ity

    Obama

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0

    1

    2

    3

    4(X 0.001)

    de

    ns

    ity

    Obama

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0

    1

    2

    3

    4(X 0.001)

    de

    ns

    ity

    Obama

    Biv ariate Density

    3555

    7595

    115Urban Percentage

    8

    12

    16

    20

    24

    Pov erty rate0

    1

    2

    3

    4(X 0.001)

    de

    ns

    ity

  • Schematic Diagram

    40

  • Data Input

    41

  • Analysis Options

    42

  • Classification Summary

    43

  • Reduced Set of Variables

    44

  • Some Improvement

    45

  • Classification Plot

    46

  • Classification Plot

    47

  • UNIWIN Plus from Sigma Plus

    Software package written by our longtime colleague Christian Charles at Sigma Plus in France.

    Contains additional features for multivariate analysis.

    Reads Statgraphics data files.

    48

  • Uniwin Cluster and Classification

    Handle qualitative variables.

    49

  • New Variables

    50

  • Uniwin Cluster Analysis

    51

  • Dendrogram

    52

  • Uniwin Clusters

    53

  • Uniwin Classification

    54

  • Classification Results

    55

    Wrongly classified only Nevada (only cattle state to vote for Obama).

  • More Information

    56

    Statgraphics Centurion: www.statgraphics.com Uniwin Plus: www.statgraphics.fr or www.sigmaplus.fr Or send e-mail to [email protected]

    http://www.statgraphics.com/http://www.statgraphics.fr/http://www.sigmaplus.fr/mailto:[email protected]

  • Join the Statgraphics Community on:

    Follow us on

    https://www.facebook.com/pages/STATGRAPHICS/79706768276http://www.linkedin.com/groups?gid=1796550https://twitter.com/STATGRAPHICS