72
DESCRIPTIVE ANALYTICS DATA REDUCTION

Descriptive Analytics: Data Reduction

Embed Size (px)

Citation preview

Page 1: Descriptive Analytics: Data Reduction

DESCRIPTIVE ANALYTICSDATA REDUCTION

Page 2: Descriptive Analytics: Data Reduction

Data reduction - breaking down large sets of data into more-manageable groups or segments that provide better insight.

◦ Data sampling

◦ Data cleaning

◦ Data transformation

◦ Data segmentation

◦ Dimension reduction

2

Page 3: Descriptive Analytics: Data Reduction

DATA SAMPLING

Page 4: Descriptive Analytics: Data Reduction

Data sampling - extract a sample of data that is relevant to the business problem under consideration.◦ A population includes all of the entities of interest in a study.

◦ A sample is a subset of the population, often randomly chosen and preferably representative of the population as a whole.

Statistical inference focuses on drawing conclusions about populations from samples.◦ Estimation of population parameters

◦ Hypothesis testing – involves drawing conclusions about the value of the parameters of one or more populations based on sample data.

4

Page 5: Descriptive Analytics: Data Reduction

Sampling plan - a description of the approach that is used to obtain samples from a population prior to any data collection activity.

A sampling plan states: its objectives

target population

population frame (the list from which the sample is selected)

operational procedures for collecting data

statistical tools for data analysis

5

Page 6: Descriptive Analytics: Data Reduction

Example: A company wants to understand how golfers might respond to a membership program that provides discounts at golf courses.

◦ Objective - estimate the proportion of golfers who would join the program

◦ Target population - golfers over 25 years old◦ Population frame - golfers who purchased equipment

at particular stores◦ Operational procedures - e-mail link to survey or direct-

mail questionnaire◦ Statistical tools - PivotTables to summarize data by

demographic groups and estimate likelihood of joining the program

6

Page 7: Descriptive Analytics: Data Reduction

Subjective sampling methods◦ Judgment sampling – expert judgment is used to select the sample

◦ Convenience sampling – samples are selected based on the ease with which the data can be collected

Probabilistic sampling methods◦ Simple random sampling involves selecting items from a population

so that every subset of a given size has an equal chance of being selected

◦ Systematic (periodic) sampling – a sampling plan that selects every nth item from the population.

◦ Stratified sampling – applies to populations that are divided into natural subsets (called strata) and allocates the appropriate proportion of samples to each stratum.

◦ ...7

Page 8: Descriptive Analytics: Data Reduction

We can determine the appropriate sample size needed to estimate the population parameter within a specified level of precision (± E).

Sample size for the mean:

Sample size for the proportion:

8

Page 9: Descriptive Analytics: Data Reduction

Using Analysis ToolPak Add-in

Data > Analysis > Data Analysis > Sampling

9

Page 10: Descriptive Analytics: Data Reduction

Sales Transactions database

Data > Data Analysis > Sampling Periodic selects every nth number Random selects a simple random sample

Sampling is done with

replacement so

duplicates may occur.

10

Page 11: Descriptive Analytics: Data Reduction

XLMiner can sample from an Excel worksheetXLMiner > Data > Get Data > Worksheet

11

Page 12: Descriptive Analytics: Data Reduction

Credit Risk Data

Click inside the database

XLMiner > Get Data > Worksheet

Select variables and move to right pane

Choose sampling options

12

Page 13: Descriptive Analytics: Data Reduction

Results

13

Page 14: Descriptive Analytics: Data Reduction

Using sample data may limit our ability to predict uncertain events that may occur because potential values outside the range of the sample data are not included.

A better approach is to identify the underlying probability distribution from which sample data come by “fitting” a theoretical distribution to the data and verifying the goodness of fit statistically.◦ Examine a histogram for clues about the distribution’s shape

◦ Look at summary statistics such as the mean, median, standard deviation, coefficient of variation, and skewness

14

Page 15: Descriptive Analytics: Data Reduction

A random variable is a numerical description of the outcome of an experiment.◦ A discrete random variable is one for which the number

of possible outcomes can be counted.

◦ A continuous random variable has outcomes over one or more continuous intervals of real numbers.

A probability distribution is a characterization of the possible values that a random variable may assume along with the probability of assuming these values.

15

Page 16: Descriptive Analytics: Data Reduction

We may develop a probability distribution using any one of the three perspectives of probability:

Classical: probabilities can be deduced from theoretical arguments

Subjective: probabilities are based on judgment and experience (This is often done in creating decision models for phenomena for which we have no historical data)

Relative frequency (empirical): probabilities are based on the relative frequencies from a sample of empirical data

16

Page 17: Descriptive Analytics: Data Reduction

Roll 2 dice 36 possible rolls (1,1), (1,2),…(6,5), (6,6)

Probability = number of ways of rolling a number divided by 35; e.g., probability of a 3 is 2/36

Suppose two consumers try a new product. Four outcomes:

1. like, like 2. like, dislike 3. dislike, like 4. dislike, dislike

Probability at least one dislikes product = 3/4

17

Page 18: Descriptive Analytics: Data Reduction

Distribution of an expert’s assessment of how the DJIA (Dow Jones Industrial Average) might change next year.

18

Page 19: Descriptive Analytics: Data Reduction

Airline Passengers

Sample data on passenger demand for 25 flights

◦ The histogram shows a relatively symmetric distribution. The mean, median, and mode are all similar, although there is moderate skewness. A normal distribution is not unreasonable.

19

Page 20: Descriptive Analytics: Data Reduction

Airport Service Times

Sample data on service times for 812 passengers at an airport’s ticketing counter

◦ It is not clear what the distribution might be. It does not appear to be exponential, but it might be lognormal or another distribution.

20

Page 21: Descriptive Analytics: Data Reduction

A better approach that simply visually examining a histogram and summary statistics is to analytically fit the data to the best type of probability distribution.

Three statistics measure goodness of fit:◦ AIC/BIC (Akaike information criterion/Bayesian information

criterion)

◦ Chi-square (need at least 50 data points)

◦ Kolmogorov-Smirnov (works well for small samples)

◦ Anderson-Darling (puts more weight on the differences between the tails of the distributions)

Analytic Solver Platform has the capability of fitting a probability distribution to data.

21

Page 22: Descriptive Analytics: Data Reduction

1. Highlight the data

Analytic Solver Platform> Tools > Fit

2. Fit Options dialogType: ContinuousTest: Kolmorgov-SmirnovClick Fit button

22

Page 23: Descriptive Analytics: Data Reduction

The best-fitting distribution is called an Erlangdistribution.

23

Page 24: Descriptive Analytics: Data Reduction

A random number is one that is uniformly

distributed between 0 to 1.

Excel function: =RAND( )

A value randomly generated from a specified probability distribution is called a random variate.

◦ Example: Uniform distribution

24

Page 25: Descriptive Analytics: Data Reduction

Analysis Toolpak Random Number Generation Tool◦ Can sample from uniform,

normal, Bernoulli, binomial, Poisson, patterned, and discrete distributions.

◦ Can also specify a random number seed – a value from which a stream of random numbers is generated. By specifying the same seed, you can produce the same random numbers at a later time.

25

Page 26: Descriptive Analytics: Data Reduction

Generate 100 outcomes from a Poisson distribution with a mean of 12◦ Number of Variables = 1

◦ Number of Random Numbers = 100

◦ Distribution = Poisson

◦ Dialog changes and prompts you to enter Lambda (mean of Poisson) = 12

26

Page 27: Descriptive Analytics: Data Reduction

Results

(Histogram created manually)27

Page 28: Descriptive Analytics: Data Reduction

Normal: =NORM.INV(RAND( ), mean, stdev)

Standard normal: =NORM.S.INV(RAND( ))

28

Page 29: Descriptive Analytics: Data Reduction

In finance, one way of evaluating capital budgeting projects is to compute a profitability index: PI = PV / I, PV is the present value of future cash flows

I is the initial investment

What is the probability distribution of PI when PV is estimated to be normally distributed with a mean of $12 million and a standard deviation of $2.5 million, and the initial investment is also estimated to be normal with a mean of $3.0 million and standard deviation of $0.8 million?

29

Page 30: Descriptive Analytics: Data Reduction

Column F:=NORM.INV(RAND(), 12, 2.5)

Column G: =NORM.INV(RAND(), 3, 0.8)

30

Page 31: Descriptive Analytics: Data Reduction

Analytic Solver Platform provides Excel functions to generate random variates for many distributions

31

Page 32: Descriptive Analytics: Data Reduction

An energy company was considering offering a new product and needed to estimate the growth in PC ownership.

Using the best data and information available, they determined that the minimum growth rate was 5.0%, the most likely value was 7.7%, and the maximum value was 10.0% (a triangular distribution).◦ A portion of 500 samples that were generated using the function

PsiTriangular(5%, 7.7%, 10%):

32

Page 33: Descriptive Analytics: Data Reduction

DATA CLEANING

Page 34: Descriptive Analytics: Data Reduction

Real data sets that have missing values or errors. Such data sets are called “dirty” and need to be “cleaned” prior to analyzing them.◦ Handling missing data

◦ Handling outliers (observations that are radically different from the rest)

34

Page 35: Descriptive Analytics: Data Reduction

Approaches for handling missing data. ◦ Eliminate the records/variables that contain missing

data

◦ Estimate reasonable values for missing observations, such as the mean or median value

◦ Use a data mining procedure to deal with them.

XLMiner has the capability to deal with missing data in the Transform menu in the Data Analysisgroup.

35

Page 36: Descriptive Analytics: Data Reduction

XLMiner's Missing Data Handling utility allows users to detect missing values in the dataset and handle them in a specified way. XLMiner considers an observation to be missing data if the cell is empty or contains an invalid formula. In addition, it is also possible to treat cells containing specific data as “missing”.

XLMiner offers several different methods for remedying missing or invalid values. Each variable can be assigned a different “treatment”. For example, the entire record could be deleted if there is a missing value for one variable, while the missing value could be replaced with a specific value for another variable.

36

Page 37: Descriptive Analytics: Data Reduction

37

Page 38: Descriptive Analytics: Data Reduction

Examining the variables in the data set by means of summary statistics, histograms, PivotTables, scatter plots, and other tools can uncover data quality issues and outliers.

Some typical rules of thumb: z-scores greater than +3 or less than -3

Extreme outliers are more than 3*IQR to the left of Q1 or right of Q3

Mild outliers are between 1.5*IQR and 3*IQR to the left of Q1 or right of Q3 Note:

* A standardized value, commonly called a z-score, provides a relative measure of the distance an observation is from the mean, which is independent of the units of measurement.* The interquartile range (IQR), or the midspread is the difference between the first and third quartiles, Q3 – Q1. 38

Page 39: Descriptive Analytics: Data Reduction

Home Market Value data

None of the z-scores exceed 3. However, while individual variables might not exhibit outliers, combinations of them might.◦ The last observation has a high market value ($120,700) but

a relatively small house size (1,581 square feet) and may be an outlier.

39

Page 40: Descriptive Analytics: Data Reduction

Closer examination of outliers may reveal an error or a need for further investigation to determine whether the observation is relevant to the current analysis.

A conservative approach is to create two data sets, one with and one without outliers, and then construct a model on both data sets.

◦ If a model’s implications depend on the inclusion or exclusion of outliers, then one should spend additional time to track down the cause of the outliers.

40

Page 41: Descriptive Analytics: Data Reduction

DATA TRANSFORMATION

Page 42: Descriptive Analytics: Data Reduction

Often data sets contain variables that, considered separately, are not particularly insightful but that, when combined as ratios, may represent important relationships.◦ Example: the price/earnings (PE) ratio

A critical task is determining how to represent the measurements of the variables and which variables to consider.◦ Example: The variable Language with the possible values

of English, German, and Spanish would be replaced with three binary variables called English, German, and Spanish.

42

Page 43: Descriptive Analytics: Data Reduction

v [min, max]

v’ = [(v - min)/(max - min)] *(max_new – min_new) + min_new

v’ [min_new, max_new]

43

Page 44: Descriptive Analytics: Data Reduction

XLMiner provides a Transform Categorical procedure under Transform in the Data Analysis group.

This procedure provides options to create dummyvariables, create ordinal category scores, and reduce categories by combining them into similar groups.

44

Page 45: Descriptive Analytics: Data Reduction

45

Page 46: Descriptive Analytics: Data Reduction

46

Page 47: Descriptive Analytics: Data Reduction

47

Page 48: Descriptive Analytics: Data Reduction

In some cases, it may be desirable to transform a continuous variable into categories. XLMiner provides a Bin Continuous Data procedure under Transform in the Data Analysis group.

Caution: In general, transforming continuous variables into categories because causes a loss of information (a continuous variable’s category is less informative than a specific numeric value) and increases the number of variables.

48

Page 49: Descriptive Analytics: Data Reduction

49

Page 50: Descriptive Analytics: Data Reduction

XLMiner calculates the interval as the (Maximum value for the x3 variable - Minimum value for the x3 variable) / #bins specified by the user or in this instance (252 - 96) / 4 which equals 39.

Bin 12: Values 96 - < 135 Bin 15: Values 135 - < 174 Bin 18: Values 174 - 213 Bin 21: Values 213 - 252

50

Page 51: Descriptive Analytics: Data Reduction

CLUSTER ANALYSIS

Page 52: Descriptive Analytics: Data Reduction

Cluster analysis, also called data segmentation, is a collection of techniques that seek to group or segment a collection of objects (observations or records) into subsets or clusters, such that those within each cluster are more closely related to one another than objects assigned to different clusters. ◦ The objects within clusters should exhibit a high amount

of similarity, whereas those in different clusters will be dissimilar.

52

Page 53: Descriptive Analytics: Data Reduction

53

Page 54: Descriptive Analytics: Data Reduction

Hierarchical clustering: The data are not partitioned into a particular cluster in a single step. Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters, each containing a single object.

k-Means clustering: Given a value of k, the k-means algorithm randomly partitions the observations into k clusters. After all observations have been assigned to a cluster, the resulting cluster centroids are calculated. Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid.

54

Page 55: Descriptive Analytics: Data Reduction

• Agglomerative clustering methods proceed by series of fusions of the n objects into groups this is the method implemented in XLMiner

• Divisive clustering methods separate n objects successively into finer groupings

55

Page 56: Descriptive Analytics: Data Reduction

Euclidean distance is the straight-line distance between two points

The Euclidean distance measure between two points (x1, x2, . . . , xn) and (y1, y2, . . . , yn) is

56

Page 57: Descriptive Analytics: Data Reduction

Single linkage clustering (nearest-neighbor)

The distance between groups is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each group are considered. At each stage, the closest 2 clusters are merged

Complete linkage clustering

The distance between groups is the distance between the most distant pair of objects, one from each group

Average linkage clustering

The distance between two clusters is defined as the average of distances between all pairs of objects, where each pair is made up of one object from each group.

Average group linkage clustering

Uses the mean values for each variable to compute distances between clusters

Ward’s hierarchical clustering

Uses a sum of squares criterionDifferent methods generally yield different results, so it is best to experiment and compare the results.

57

Page 58: Descriptive Analytics: Data Reduction

58

Page 59: Descriptive Analytics: Data Reduction

Colleges and Universities Data

Cluster the institutions using the five numeric columns in the data set.

XLMiner > Data Analysis > Cluster > Hierarchical Clustering

Note: We are clustering the numerical variables, so School and Type are not included.

59

Page 60: Descriptive Analytics: Data Reduction

Check the box Normalize input data to ensure that the distance measure accords equal weight to each variable

Use the Euclidean distance as the similarity measure for numeric data.

Select the clustering method you wish to use.

60

Page 61: Descriptive Analytics: Data Reduction

Select the number of clusters (The agglomerative method of hierarchical clustering keeps forming clusters until only one cluster is left. This option lets you stop the process at a given number of clusters.) We selected four clusters.

61

Page 62: Descriptive Analytics: Data Reduction

Results

62

Page 63: Descriptive Analytics: Data Reduction

Dendogram illustrates the fusions or divisions made at each successive stage of analysis

A horizontal line shows the cluster partitions

63

Page 64: Descriptive Analytics: Data Reduction

Predicted clusters◦ shows the assignment of

observations to the number of clusters we specified in the input dialog, (in this case four)

Cluster # Colleges

1 23

2 22

3 3

4 1

64

Page 65: Descriptive Analytics: Data Reduction

DIMENSION REDUCTION

Page 66: Descriptive Analytics: Data Reduction

Dimension reduction - process of removing variables from the analysis without losing any crucial information.◦ One way is to examine pairwise correlations to detect

variables or groups of variables that may supply similar information. Such variables can be aggregated or removed to allow more parsimonious model development.

Dimension reduction in XLMiner◦ Feature selection

◦ Principal components

66

Page 67: Descriptive Analytics: Data Reduction

Feature Selection attempts to identify the best subset of variables (or features) out of the available variables (or features) to be used as input to a classification or prediction method.

67

Page 68: Descriptive Analytics: Data Reduction

68

Page 69: Descriptive Analytics: Data Reduction

The Principal Components procedure can be found on the XLMiner tab under Transform in the Data Analysis group.

Principal components analysis creates a collection of metavariables (components) that are weighted sums of the original variables. These components are uncorrelated with each other and often only a few of them are needed to convey the same information as the large set of original variables. 69

Page 70: Descriptive Analytics: Data Reduction

70

Page 71: Descriptive Analytics: Data Reduction

Data reduction - breaking down large sets of data into more-manageable groups or segments that provide better insight.

◦ Data sampling

◦ Data cleaning

◦ Data transformation

◦ Data segmentation

◦ Dimension reduction

71

Page 72: Descriptive Analytics: Data Reduction

3-72