Descriptive Analytics: Data Reduction

DESCRIPTIVE ANALYTICSDATA REDUCTION

Data reduction - breaking down large sets of data into more-manageable groups or segments that provide better insight.

◦ Data sampling

◦ Data cleaning

◦ Data transformation

◦ Data segmentation

◦ Dimension reduction

2

DATA SAMPLING

Data sampling - extract a sample of data that is relevant to the business problem under consideration.◦ A population includes all of the entities of interest in a study.

◦ A sample is a subset of the population, often randomly chosen and preferably representative of the population as a whole.

Statistical inference focuses on drawing conclusions about populations from samples.◦ Estimation of population parameters

◦ Hypothesis testing – involves drawing conclusions about the value of the parameters of one or more populations based on sample data.

4

Sampling plan - a description of the approach that is used to obtain samples from a population prior to any data collection activity.

A sampling plan states: its objectives

target population

population frame (the list from which the sample is selected)

operational procedures for collecting data

statistical tools for data analysis

5

Example: A company wants to understand how golfers might respond to a membership program that provides discounts at golf courses.

◦ Objective - estimate the proportion of golfers who would join the program

◦ Target population - golfers over 25 years old◦ Population frame - golfers who purchased equipment

at particular stores◦ Operational procedures - e-mail link to survey or direct-

mail questionnaire◦ Statistical tools - PivotTables to summarize data by

demographic groups and estimate likelihood of joining the program

6

Subjective sampling methods◦ Judgment sampling – expert judgment is used to select the sample

◦ Convenience sampling – samples are selected based on the ease with which the data can be collected

Probabilistic sampling methods◦ Simple random sampling involves selecting items from a population

so that every subset of a given size has an equal chance of being selected

◦ Systematic (periodic) sampling – a sampling plan that selects every nth item from the population.

◦ Stratified sampling – applies to populations that are divided into natural subsets (called strata) and allocates the appropriate proportion of samples to each stratum.

◦ ...7

We can determine the appropriate sample size needed to estimate the population parameter within a specified level of precision (± E).

Sample size for the mean:

Sample size for the proportion:

8

Using Analysis ToolPak Add-in

Data > Analysis > Data Analysis > Sampling

9

Sales Transactions database

Data > Data Analysis > Sampling Periodic selects every nth number Random selects a simple random sample

Sampling is done with

replacement so

duplicates may occur.

10

XLMiner can sample from an Excel worksheetXLMiner > Data > Get Data > Worksheet

11

Credit Risk Data

Click inside the database

XLMiner > Get Data > Worksheet

Select variables and move to right pane

Choose sampling options

12

Results

13

Using sample data may limit our ability to predict uncertain events that may occur because potential values outside the range of the sample data are not included.

A better approach is to identify the underlying probability distribution from which sample data come by “fitting” a theoretical distribution to the data and verifying the goodness of fit statistically.◦ Examine a histogram for clues about the distribution’s shape

◦ Look at summary statistics such as the mean, median, standard deviation, coefficient of variation, and skewness

14

A random variable is a numerical description of the outcome of an experiment.◦ A discrete random variable is one for which the number

of possible outcomes can be counted.

◦ A continuous random variable has outcomes over one or more continuous intervals of real numbers.

A probability distribution is a characterization of the possible values that a random variable may assume along with the probability of assuming these values.

15

We may develop a probability distribution using any one of the three perspectives of probability:

Classical: probabilities can be deduced from theoretical arguments

Subjective: probabilities are based on judgment and experience (This is often done in creating decision models for phenomena for which we have no historical data)

Relative frequency (empirical): probabilities are based on the relative frequencies from a sample of empirical data

16

Roll 2 dice 36 possible rolls (1,1), (1,2),…(6,5), (6,6)

Probability = number of ways of rolling a number divided by 35; e.g., probability of a 3 is 2/36

Suppose two consumers try a new product. Four outcomes:

1. like, like 2. like, dislike 3. dislike, like 4. dislike, dislike

Probability at least one dislikes product = 3/4

17

Distribution of an expert’s assessment of how the DJIA (Dow Jones Industrial Average) might change next year.

18

Airline Passengers

Sample data on passenger demand for 25 flights

◦ The histogram shows a relatively symmetric distribution. The mean, median, and mode are all similar, although there is moderate skewness. A normal distribution is not unreasonable.

19

Airport Service Times

Sample data on service times for 812 passengers at an airport’s ticketing counter

◦ It is not clear what the distribution might be. It does not appear to be exponential, but it might be lognormal or another distribution.

20

A better approach that simply visually examining a histogram and summary statistics is to analytically fit the data to the best type of probability distribution.

Three statistics measure goodness of fit:◦ AIC/BIC (Akaike information criterion/Bayesian information

criterion)

◦ Chi-square (need at least 50 data points)

◦ Kolmogorov-Smirnov (works well for small samples)

◦ Anderson-Darling (puts more weight on the differences between the tails of the distributions)

Analytic Solver Platform has the capability of fitting a probability distribution to data.

21

1. Highlight the data

Analytic Solver Platform> Tools > Fit

2. Fit Options dialogType: ContinuousTest: Kolmorgov-SmirnovClick Fit button

22

The best-fitting distribution is called an Erlangdistribution.

23

A random number is one that is uniformly

distributed between 0 to 1.

Excel function: =RAND( )

A value randomly generated from a specified probability distribution is called a random variate.

◦ Example: Uniform distribution

24

Analysis Toolpak Random Number Generation Tool◦ Can sample from uniform,

normal, Bernoulli, binomial, Poisson, patterned, and discrete distributions.

◦ Can also specify a random number seed – a value from which a stream of random numbers is generated. By specifying the same seed, you can produce the same random numbers at a later time.

25

Generate 100 outcomes from a Poisson distribution with a mean of 12◦ Number of Variables = 1

◦ Number of Random Numbers = 100

◦ Distribution = Poisson

◦ Dialog changes and prompts you to enter Lambda (mean of Poisson) = 12

26

Results

(Histogram created manually)27

Normal: =NORM.INV(RAND( ), mean, stdev)

Standard normal: =NORM.S.INV(RAND( ))

28

In finance, one way of evaluating capital budgeting projects is to compute a profitability index: PI = PV / I, PV is the present value of future cash flows

I is the initial investment

What is the probability distribution of PI when PV is estimated to be normally distributed with a mean of $12 million and a standard deviation of $2.5 million, and the initial investment is also estimated to be normal with a mean of $3.0 million and standard deviation of $0.8 million?

29

Column F:=NORM.INV(RAND(), 12, 2.5)

Column G: =NORM.INV(RAND(), 3, 0.8)

30

Analytic Solver Platform provides Excel functions to generate random variates for many distributions

31

An energy company was considering offering a new product and needed to estimate the growth in PC ownership.

Using the best data and information available, they determined that the minimum growth rate was 5.0%, the most likely value was 7.7%, and the maximum value was 10.0% (a triangular distribution).◦ A portion of 500 samples that were generated using the function

PsiTriangular(5%, 7.7%, 10%):

32

DATA CLEANING

Real data sets that have missing values or errors. Such data sets are called “dirty” and need to be “cleaned” prior to analyzing them.◦ Handling missing data

◦ Handling outliers (observations that are radically different from the rest)

34

Approaches for handling missing data. ◦ Eliminate the records/variables that contain missing

data

◦ Estimate reasonable values for missing observations, such as the mean or median value

◦ Use a data mining procedure to deal with them.

XLMiner has the capability to deal with missing data in the Transform menu in the Data Analysisgroup.

35

XLMiner's Missing Data Handling utility allows users to detect missing values in the dataset and handle them in a specified way. XLMiner considers an observation to be missing data if the cell is empty or contains an invalid formula. In addition, it is also possible to treat cells containing specific data as “missing”.

XLMiner offers several different methods for remedying missing or invalid values. Each variable can be assigned a different “treatment”. For example, the entire record could be deleted if there is a missing value for one variable, while the missing value could be replaced with a specific value for another variable.

36

37

Examining the variables in the data set by means of summary statistics, histograms, PivotTables, scatter plots, and other tools can uncover data quality issues and outliers.

Some typical rules of thumb: z-scores greater than +3 or less than -3

Extreme outliers are more than 3*IQR to the left of Q1 or right of Q3

Mild outliers are between 1.5*IQR and 3*IQR to the left of Q1 or right of Q3 Note:

* A standardized value, commonly called a z-score, provides a relative measure of the distance an observation is from the mean, which is independent of the units of measurement.* The interquartile range (IQR), or the midspread is the difference between the first and third quartiles, Q3 – Q1. 38

Home Market Value data

None of the z-scores exceed 3. However, while individual variables might not exhibit outliers, combinations of them might.◦ The last observation has a high market value ($120,700) but

a relatively small house size (1,581 square feet) and may be an outlier.

39

Closer examination of outliers may reveal an error or a need for further investigation to determine whether the observation is relevant to the current analysis.

A conservative approach is to create two data sets, one with and one without outliers, and then construct a model on both data sets.

◦ If a model’s implications depend on the inclusion or exclusion of outliers, then one should spend additional time to track down the cause of the outliers.

40

DATA TRANSFORMATION

Often data sets contain variables that, considered separately, are not particularly insightful but that, when combined as ratios, may represent important relationships.◦ Example: the price/earnings (PE) ratio

A critical task is determining how to represent the measurements of the variables and which variables to consider.◦ Example: The variable Language with the possible values

of English, German, and Spanish would be replaced with three binary variables called English, German, and Spanish.

42

v [min, max]

v’ = [(v - min)/(max - min)] *(max_new – min_new) + min_new

v’ [min_new, max_new]

43

XLMiner provides a Transform Categorical procedure under Transform in the Data Analysis group.

This procedure provides options to create dummyvariables, create ordinal category scores, and reduce categories by combining them into similar groups.

44

45

46

47

In some cases, it may be desirable to transform a continuous variable into categories. XLMiner provides a Bin Continuous Data procedure under Transform in the Data Analysis group.

Caution: In general, transforming continuous variables into categories because causes a loss of information (a continuous variable’s category is less informative than a specific numeric value) and increases the number of variables.

48

49

XLMiner calculates the interval as the (Maximum value for the x3 variable - Minimum value for the x3 variable) / #bins specified by the user or in this instance (252 - 96) / 4 which equals 39.

Bin 12: Values 96 - < 135 Bin 15: Values 135 - < 174 Bin 18: Values 174 - 213 Bin 21: Values 213 - 252

50

CLUSTER ANALYSIS

Cluster analysis, also called data segmentation, is a collection of techniques that seek to group or segment a collection of objects (observations or records) into subsets or clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters. ◦ The objects within clusters should exhibit a high amount

of similarity, whereas those in different clusters will be dissimilar.

52

53

Hierarchical clustering: The data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters, each containing a single object.

k-Means clustering: Given a value of k, the k-means algorithm randomly partitions the observations into k clusters. After all observations have been assigned to a cluster, the resulting cluster centroids are calculated. Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid.

54

• Agglomerative clustering methods proceed by series of fusions of the n objects into groups this is the method implemented in XLMiner

• Divisive clustering methods separate n objects successively into finer groupings

55

Euclidean distance is the straight-line distance between two points

The Euclidean distance measure between two points (x1, x2, . . . , xn) and (y1, y2, . . . , yn) is

56

Single linkage clustering (nearest-neighbor)

The distance between groups is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each group are considered. At each stage, the closest 2 clusters are merged

Complete linkage clustering

The distance between groups is the distance between the most distant pair of objects, one from each group

Average linkage clustering

The distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group.

Average group linkage clustering

Uses the mean values for each variable to compute distances between clusters

Ward’s hierarchical clustering

Uses a sum of squares criterionDifferent methods generally yield different results, so it is best to experiment and compare the results.

57

58

Colleges and Universities Data

Cluster the institutions using the five numeric columns in the data set.

XLMiner > Data Analysis > Cluster > Hierarchical Clustering

Note: We are clustering the numerical variables, so School and Type are not included.

59

Check the box Normalize input data to ensure that the distance measure accords equal weight to each variable

Use the Euclidean distance as the similarity measure for numeric data.

Select the clustering method you wish to use.

60

Select the number of clusters (The agglomerative method of hierarchical clustering keeps forming clusters until only one cluster is left. This option lets you stop the process at a given number of clusters.) We selected four clusters.

61

Results

62

Dendogram illustrates the fusions or divisions made at each successive stage of analysis

A horizontal line shows the cluster partitions

63

Predicted clusters◦ shows the assignment of

observations to the number of clusters we specified in the input dialog, (in this case four)

Cluster # Colleges

1 23

2 22

3 3

4 1

64

DIMENSION REDUCTION

Dimension reduction - process of removing variables from the analysis without losing any crucial information.◦ One way is to examine pairwise correlations to detect

variables or groups of variables that may supply similar information. Such variables can be aggregated or removed to allow more parsimonious model development.

Dimension reduction in XLMiner◦ Feature selection

◦ Principal components

66

Feature Selection attempts to identify the best subset of variables (or features) out of the available variables (or features) to be used as input to a classification or prediction method.

67

68

The Principal Components procedure can be found on the XLMiner tab under Transform in the Data Analysis group.

Principal components analysis creates a collection of metavariables (components) that are weighted sums of the original variables. These components are uncorrelated with each other and often only a few of them are needed to convey the same information as the large set of original variables. 69

70

Data reduction - breaking down large sets of data into more-manageable groups or segments that provide better insight.

◦ Data sampling

◦ Data cleaning

◦ Data transformation

◦ Data segmentation

◦ Dimension reduction

71

3-72

Business

Descriptive Analytics: Data Reduction