Inmon - Data Mining - Exploring the Data

Embed Size (px)

Citation preview

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    1/22

    1

    1997 William H. Inmon, all rights reserved

    T E C H T O P I C 6

    DATA MINING: EXPLORING THE DATAPART 2

    by W.H. Inmon

    [This Tech Topic is part 2 in the series of Tech Topics on data mining and data exploration. It is assumed thatthe reader has already read Part 1.]

    Data Correlation

    The basis for data mining and data exploration is the correlation of data. When data can becorrelated mathematically and from a business basis, assumptions can be made and the basis forcommercial exploitation is formed.

    The groundwork for correlation is the mathematical relationship between two or more variables.There are a wide variety of correlations of data. Figure 2.1 shows some simpletypes of correlations of data.

    The first correlation in Figure 2.1 is a perfect correlation of data. In this case, for every occurrence ofA there is an occurrence of B, and vice versa. Such an occurrence seldom happens, but when it does,the perfect correlation forms a very sound basis for exploitation.

    A more normal case is a strong correlation of data, which is represented by the second set of datashown in Figure 2.1. In the second set of data, in most cases, where there is an A there is also a B. But

    in a few cases A will exist when there is no B, and B will sometimes exist where there is no A. Thiscorrelation is fairly common and also forms a sound basis for further exploitation.

    The third correlation shown in Figure 2.1 is a weak correlation. In a weak correlation, in some caseswhen there exists an A there will also exist a B, or in some cases where there exists a B there will alsoexist an A. But in many cases A and B will exist independently. As shown, the weak correlationbetween A and B does not form a particularly good case for commercial exploitation. And, if that isall there is, then nothing more can be done.

    a perfect correlation between two variables

    a strong correlation between two variables

    a weak correlation between two variables

    different strengths of correlation

    Figure 2.1

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    2/22

    2

    1997 William H. Inmon, all rights reserved

    But in a way, weak correlations are the most interesting of all for some very good reasons. The allureof weak correlations is that they may point to important trends that are as yet undiscovered, andbecause they are undiscovered, can lead to opportunities for exploitation that are as yet unknown.Therefore, there are two very important aspects of weak correlations that are worth exploring:

    s Is the correlation growing more significant over time?s Is the correlation very strong for a subset of either A or B?

    If the weak correlation is growing stronger over time, it is entirely possible that the weak correlationis a harbinger of large new trends that are just now developing. If such is the case, there may bemassive opportunities for exploitation. Of course, how weak the correlation is and how fast thestrength of the correlation is increasing is very relevant. If a correlation is increasing in strength atglacial speed, it will be very difficult to exploit the correlation. And if the correlation is so weak that itis useless, it will be a long time for the correlation to become strong enough to be able to be exploited.

    The second case where a weak correlation is of interest is where there is a weak correlation for theentire population, but a quite strong correlation for a subset of the population. In other words, ifthere are other characteristics of A that can be used to select a subpopulation of A, and if after havingselected that subpopulation, the correlation between A and B becomes much stronger, then there willmost likely be a significant opportunity for commercial exploitation by targeting the subpopulation

    of A that strongly correlates to B.

    For these reasons, weak correlations are some of the most interesting of all the different ways thatdata can be correlated.

    Of course, where there is no correlation of data at all, there is little chance for exploitation through thetechniques of data mining.

    There are of course other types of correlations other than correlations based on existence criteria. Thecorrelations that have been discussed are based on whether two variables exist in the presence of eachother. Another very common type of correlation is not based on existence at all, but is based onvalues. As an example, suppose A and B always exist in each other's presence. When A has a valuegreater than 50, B has a value greater than 100, and when A has a value greater than 100, B has avalue greater than 200, and so forth. The correlation is measured not in terms of existence ofvariables, but in terms of the values of the variables compared to each other.

    Multivariate Analysis

    Of course there are correlations between more than two variables. Figure 2.2 suggests a simple formof multivariate analysis.

    The relationships that can be divined doing multivariate analysis can be as interesting to the DSSanalyst doing data mining and data exploration on the more common dual variable analysis. When

    Figure 2.2

    multivariate analysis

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    3/22

    3

    1997 William H. Inmon, all rights reserved

    the DSS analyst discovers a multivariate correlation, there is the potential for exploitation just as thereis in the case of a correlation between two variables.

    However, the discovery of a multivariate correlation is a difficult thing to accomplish in a datawarehouse for two very important reasons:

    s the volume of data in the data warehouse makes analysis of multiple variables a very

    difficult thing to do, ands the number of variables is usually so many that it is not clear which combination of variables

    should be analyzed together.

    In addition, even when a multivariate correlation is discovered, the underlying business case may bevery spurious.

    There is then great potential in doing multivariate analysis in a data mining and data explorationeffort. However, the DSS analyst should be aware of the obstacles that await.

    The Spectrum of Correlation

    Whether dual variables are correlated or multiple variables are correlated, the result will be aspectrum of strengths of correlation. Figure 2.3 depicts the spectrum of correlation that lies ahead forthe DSS analyst.

    Corresponding to the spectrum of correlation is the spectrum of opportunity that awaits the DSSanalyst that would exploit a correlation. As a general rule, the stronger the correlation, the greater thechance that the correlation is already well known. The greater the chance the correlation is wellknown, the greater the chance it is already being exploited. In other words, for well knowncorrelations, even though the correlation is strong, the opportunity for exploitation is minimalbecause the competition is already making use of the relationship. However, if a strong relationshipdevelops between two variables that are not well known, then there is a major opportunity forexploitation.

    For correlations that are not so strong there is a real opportunity for exploitation, if the correlation:s has not been widely discovered,

    s is growing in strength, ors is very strong for an identifiable subset of the population being studied.

    The DSS analyst should be aware of the spectrum of strength of correlation and should be aware ofwhat opportunities are possible based on the strength of the correlation.

    very strongcorrelation

    no correlationwhatsoever

    there is a large spectrum of correlationbetween variables

    Figure 2.3

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    4/22

    4

    1997 William H. Inmon, all rights reserved

    The Business Relationship

    Even where there is a valid mathematical relationship that has been discovered, there is noguarantee that this relationship will lead to an opportunity for exploitation. Figure 2.4 showsthat, after the mathematical correlation has been established, there must be an analysis of theunderlying business relationship.

    Some of the possibilities in the analyzing of the underlying business relationship are:s the correlation is a false positive and has no underlying business relationship whatsoever,s the correlation is a previously unknown relationship and is ripe for exploitation,s the correlation has a business relationship at its basis, but the business relationship is well

    known and is already being exploited to the fullest, ands the correlation is mathematically valid and has no underlying business relationship, but therelationship is so strong that the correlation will still present opportunities for exploitation.

    Each of these circumstances will be discussed.

    In the case where there is a false positive and where there is no business relationship whatsoever, it isimportant that the DSS analyst know because the DSS analyst can save a large amount of resources bynot attempting to try to exploit an opportunity that is not viable.

    In the case where the DSS analyst has discovered a previously unknown correlation and there is abusiness basis for the relationship, the DSS analyst has a rare and powerful opportunity for exploitation.

    In the case where there is both a mathematical relationship and a business basis for the relationshipand where the relationship is well known, it is unlikely that there will be an opportunity forexploitation simply because the relationship has already been exploited.

    In the case where there is a strong mathematical correlation, but no apparent business basis, there areplenty of opportunities. One opportunity is to examine the business basis very carefully to discoverwhether there really is a basis, however subtle. Where there is a subtle business basis, there will be

    f(x) dx

    x=1, 10

    business

    mathematical

    just because there is a mathematical relationshipbetween two or more variables does not necessarilymean that there is a business relationship. If thereis no business relationship, exploitation will be verydifficult to do

    Figure 2.4

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    5/22

    5

    1997 William H. Inmon, all rights reserved

    plenty of opportunity for exploitation because it is unlikely that this business relationship will havebeen found.

    Even in the case where there is no discernible business basis for a correlation, if the mathematicalrelationship is strong enough and is consistent enough, there will still be an opportunity forexploitation through the sheer strength of the relationship.

    Looking for Correlations

    The easiest place to start to look for correlations is in the most obvious places. Consider the scenarioshown in Figure 2.5.

    In Figure 2.5 it is seen that on rainy days that shops near bus stops will get more walk-in traffic.There may not be any purchases that take place, but there will be an abundance of foot traffic.

    There are many cases where the correlation is obvious in the summer, a lot of beer is consumed. Inthe winter, snow chains are frequently sold. In the spring, lawn mowers are popular, and so forth.

    In the case of obvious correlations, the opportunity lies not so much in the discovery of thecorrelation but in the creation of novel ways to exploit the correlation.

    Figure 2.5

    BUS

    FIVE AND DIME

    foot traffic at the five and dime is heavier on rainy days

    the starting place to look for correlations is the obvious place

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    6/22

    6

    1997 William H. Inmon, all rights reserved

    As an example of a macro analysis of summary data and its use to discover correlations, consider thegraph in Figure 2.6.

    Figure 2.6 shows that for a retailer, sales progress throughout the year and peak in December, at thetime of the Christmas holidays. The summary table suggests that when this year's highs are greaterthan last year's highs and this year's lows are less than last year's lows, there might be someinteresting correlations of data that could be observed.

    At an even more macro level, consider the profitability over a long period of time of several insurancecompanies, as seen in Figure 2.7.

    Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

    $ 1 bil

    900 mil800 mil700 mil600 mil500 mil400 mil300 mil200 mil100 mil

    retail sales peak at Christmas time

    an obvious pattern

    Figure 2.6

    1987 1988 1989 1990 1991 1992 1993 1994 1995 1996

    100 mil

    50 mil

    0

    -50 mil

    insurance company A

    insurance company B

    insurance company C

    insurance company D

    comparing the profitability of insurancecompanies over time

    - how have they all been alike?- how have they been different?

    Figure 2.7

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    7/22

    7

    1997 William H. Inmon, all rights reserved

    In Figure 2.7, four insurance companies have shown their profitability over a decade. The obviouspoints of interest (and the most likely place to look for interesting correlations) are:

    s points where one company falls lower than all other companies or where one company soarshigher than other companies, and

    s points where one company is operating on a different trend line than other companies.

    These macro observations lead to places where productive correlations can be found.

    Finding Interesting Correlations at the Micro Level

    Looking at macro indicators is a good way to find where in the grand scheme of things interestingcorrelations will reside, but analysis at the macro level can never get to the level of detail that willsatisfy the DSS analyst. Analysis must proceed to the micro level in order to be productive.

    Figure 2.8 shows accumulation of a large number of sales data.

    One of the ways to find where there are interesting correlations is to ask where are the exceptionsto the norm? In the case of the data found in Figure 2.8, the DSS analyst has honed in on all salesgreater than $10.00. There are five sales greater than $10.00. In fact, the sales greater than $10.00 are sosignificantly larger than all the other sales that there must be something interesting about the sales.The DSS analyst is tipped off to look at such things as:

    s What items were purchased?s Were the items purchased as a group?s Who were the items purchased by?s When were the items purchased?

    These questions may not lead to any interesting observations, but there is something unusual about

    the purchases greater than $10.00. Discovering what other data correlates to the sale amount is aproductive place to start the analysis of the data.

    Figure 2.8

    sale amt = 2.98sale amt = 65.98sale amt = 3.87sale amt = 48.76sale amt = 3.97sale amt = 2.76sale amt = 1.87sale amt = 3.75sale amt = 4.97sale amt = 5.01sale amt = 2.98sale amt = 4.65sale amt = 3.97sale amt = 2.86sale amt = 1.86sale amt = 2.97

    sale amt = 4.65sale amt = 3.65sale amt = 2.87sale amt = 3.97sale amt = 2.87sale amt = 78.32sale amt = 3.54sale amt = 4.33sale amt = 4.72sale amt = 3.34sale amt = 1.32sale amt = 3.37sale amt = 3.97sale amt = 1.45sale amt = 2.22sale amt = 8.42

    sale amt = 4.98sale amt = 3.22sale amt = 2.01

    sale amt = 3.97sale amt = 2.72sale amt = 4.16sale amt = 3.84sale amt = 1.76sale amt = 3.77sale amt = 1.22sale amt = 4.42sale amt = 3.20sale amt = 3.97sale amt = 1.53sale amt = 3.66sale amt = 1.10

    sale amt = 5.56sale amt = 3.09sale amt = 5.92

    sale amt = 4.46sale amt = 3.31sale amt = 1.29sale amt = 3.97sale amt = 1.79sale amt = 3.29sale amt = 1.56sale amt = 2.87sale amt = 2.97sale amt = 3.97sale amt = 1.72sale amt = 3.97sale amt = 2.73

    sale amt = 3.09sale amt = 3.37sale amt = 1.87sale amt = 4.92sale amt = 17.29sale amt = 2.33sale amt = 3.97sale amt = 1.79sale amt = 3.28sale amt = 2.38sale amt = 1.29sale amt = 3.41sale amt = 1.87sale amt = 1.75sale amt = 2.96sale amt = 3.19

    sale amt = 1.71sale amt = 3.09sale amt = 2.86sale amt = 2.19sale amt = 3.19sale amt = 2.37sale amt = 2.18sale amt = 1.98sale amt = 2.07sale amt = .98sale amt = 1.62sale amt = 3.82sale amt = 28.27sale amt = .93sale amt = 1.67sale amt = 2.97

    find all sales amounts greater than 10.00

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    8/22

    8

    1997 William H. Inmon, all rights reserved

    In the same vein, finding the smallest and the largest sales may well lead to productive results.Figure 2.9 shows this simple analysis.

    And along the same line, looking for the five largest (or the five smallest) may be productive.Figure 2.10 shows this criteria.

    Figure 2.9

    Figure 2.10

    sale amt = 2.98sale amt = 65.98sale amt = 3.87

    sale amt = 48.76sale amt = 3.97sale amt = 2.76sale amt = 1.87sale amt = 3.75sale amt = 4.97sale amt = 5.01sale amt = 2.98sale amt = 4.65sale amt = 3.97sale amt = 2.86sale amt = 1.86sale amt = 2.97

    sale amt = 4.65sale amt = 3.65sale amt = 2.87sale amt = 3.97sale amt = 2.87sale amt = 78.32sale amt = 3.54sale amt = 4.33sale amt = 4.72sale amt = 3.34sale amt = 1.32sale amt = 3.37sale amt = 3.97sale amt = 1.45sale amt = 2.22sale amt = 8.42

    sale amt = 4.98sale amt = 3.22sale amt = 2.01sale amt = 3.97sale amt = 2.72sale amt = 4.16sale amt = 3.84sale amt = 1.76sale amt = 3.77sale amt = 1.22sale amt = 4.42sale amt = 3.20sale amt = 3.97sale amt = 1.53sale amt = 3.66sale amt = 1.10

    sale amt = 5.56sale amt = 3.09sale amt = 5.92sale amt = 4.46sale amt = 3.31sale amt = 1.29sale amt = 3.97sale amt = 1.79sale amt = 3.29sale amt = 1.56sale amt = 2.87sale amt = 2.97sale amt = 3.97sale amt = 1.72sale amt = 3.97sale amt = 2.73

    sale amt = 3.09sale amt = 3.37sale amt = 1.87

    sale amt = 4.92sale amt = 17.29sale amt = 2.33sale amt = 3.97sale amt = 1.79sale amt = 3.28sale amt = 2.38sale amt = 1.29sale amt = 3.41sale amt = 1.87sale amt = 1.75sale amt = 2.96sale amt = 3.19

    sale amt = 1.71sale amt = 3.09sale amt = 2.86

    sale amt = 2.19sale amt = 3.19sale amt = 2.37sale amt = 2.18sale amt = 1.98sale amt = 2.07sale amt = .98sale amt = 1.62sale amt = 3.82sale amt = 28.27sale amt = .93sale amt = 1.67sale amt = 2.97

    find the largest and the smallest

    sale amt = 2.98sale amt = 65.98sale amt = 3.87sale amt = 48.76sale amt = 3.97sale amt = 2.76sale amt = 1.87sale amt = 3.75sale amt = 4.97

    sale amt = 5.01sale amt = 2.98sale amt = 4.65sale amt = 3.97sale amt = 2.86sale amt = 1.86sale amt = 2.97

    sale amt = 4.65sale amt = 3.65sale amt = 2.87sale amt = 3.97sale amt = 2.87sale amt = 78.32sale amt = 3.54sale amt = 4.33sale amt = 4.72sale amt = 3.34sale amt = 1.32sale amt = 3.37sale amt = 3.97sale amt = 1.45sale amt = 2.22sale amt = 8.42

    sale amt = 4.98sale amt = 3.22sale amt = 2.01sale amt = 3.97sale amt = 2.72sale amt = 4.16sale amt = 3.84sale amt = 1.76sale amt = 3.77sale amt = 1.22sale amt = 4.42sale amt = 3.20sale amt = 3.97sale amt = 1.53sale amt = 3.66sale amt = 1.10

    sale amt = 5.56sale amt = 3.09sale amt = 5.92sale amt = 4.46sale amt = 3.31sale amt = 1.29sale amt = 3.97sale amt = 1.79sale amt = 3.29sale amt = 1.56sale amt = 2.87sale amt = 2.97sale amt = 3.97sale amt = 1.72sale amt = 3.97sale amt = 2.73

    sale amt = 3.09sale amt = 3.37sale amt = 1.87sale amt = 4.92sale amt = 17.29sale amt = 2.33sale amt = 3.97sale amt = 1.79sale amt = 3.28

    sale amt = 2.38sale amt = 1.29sale amt = 3.41sale amt = 1.87sale amt = 1.75sale amt = 2.96sale amt = 3.19

    sale amt = 1.71sale amt = 3.09sale amt = 2.86sale amt = 2.19sale amt = 3.19sale amt = 2.37sale amt = 2.18sale amt = 1.98sale amt = 2.07

    sale amt = .98sale amt = 1.62sale amt = 3.82sale amt = 28.27sale amt = .93sale amt = 1.67sale amt = 2.97

    find the five largest

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    9/22

    9

    1997 William H. Inmon, all rights reserved

    Yet another way to analyze the sales data, looking for a useful way to understand the data and todetermine where there might be interesting correlations, is to create a profile of the data. Certainlyaverages and medians (i.e., mid points) can be calculated, and those numbers are meaningful. Butanother meaningful way to characterize the data is in terms of a profile. Figure 2.11 depicts asimple profile of the sales data.

    The profile is useful to determine if anything unusual is happening to the data. Within thepopulation of the data itself there may be hidden trends and ratios. The profile created in Figure 2.11shows that the vast majority of sales are between $1.00 and $4.00. Anything outside of that range isan anomaly. The sales form a slightly skewed bell curve around the $3.00 sales mark.

    The use of a profile is to characterize the masses of data in a perspective that is not immediatelyobvious from examining the details of the data directly.

    sale amt = 2.98sale amt = 65.98sale amt = 3.87sale amt = 48.76sale amt = 3.97sale amt = 2.76sale amt = 1.87sale amt = 3.75sale amt = 4.97sale amt = 5.01sale amt = 2.98sale amt = 4.65sale amt = 3.97sale amt = 2.86sale amt = 1.86sale amt = 2.97

    sale amt = 4.65sale amt = 3.65sale amt = 2.87sale amt = 3.97sale amt = 2.87sale amt = 78.32sale amt = 3.54sale amt = 4.33sale amt = 4.72sale amt = 3.34sale amt = 1.32sale amt = 3.37sale amt = 3.97sale amt = 1.45sale amt = 2.22sale amt = 8.42

    sale amt = 4.98sale amt = 3.22sale amt = 2.01sale amt = 3.97sale amt = 2.72sale amt = 4.16sale amt = 3.84sale amt = 1.76sale amt = 3.77sale amt = 1.22sale amt = 4.42sale amt = 3.20sale amt = 3.97sale amt = 1.53sale amt = 3.66sale amt = 1.10

    sale amt = 5.56sale amt = 3.09sale amt = 5.92sale amt = 4.46sale amt = 3.31sale amt = 1.29sale amt = 3.97sale amt = 1.79sale amt = 3.29sale amt = 1.56sale amt = 2.87sale amt = 2.97sale amt = 3.97sale amt = 1.72sale amt = 3.97sale amt = 2.73

    sale amt = 3.09sale amt = 3.37sale amt = 1.87sale amt = 4.92sale amt = 17.29sale amt = 2.33sale amt = 3.97sale amt = 1.79sale amt = 3.28sale amt = 2.38sale amt = 1.29sale amt = 3.41sale amt = 1.87sale amt = 1.75sale amt = 2.96sale amt = 3.19

    sale amt = 1.71sale amt = 3.09sale amt = 2.86sale amt = 2.19sale amt = 3.19sale amt = 2.37sale amt = 2.18sale amt = 1.98sale amt = 2.07sale amt = .98sale amt = 1.62sale amt = 3.82sale amt = 28.27sale amt = .93sale amt = 1.67sale amt = 2.97

    create a profile

    from .00 to .99 2from 1.00 to 1.99 21from 2.00 to 2.99 23from 3.00 to 3.99 31from 4.00 to 4.99 10from 5.00 to 5.99 3from 6.00 to 6.99 0greater than 6.99 6

    Figure 2.11

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    10/22

    10

    1997 William H. Inmon, all rights reserved

    Still another way to do a detailed analysis of data is to use a scatter chart. Figure 2.12 shows a simplescatter chart.

    In Figure 2.12, the shelf time of an item that has been sold is correlated with the price of the item.

    Two noticeable trends emerge from the scatter chart. Figure 2.13 shows those two trends.

    Figure 2.12

    Figure 2.13

    shelf time - 5 days,shelf time - 1 days,shelf time - 20 days,shelf time - 3 days,shelf time - 21 days,

    shelf time - 10 days,shelf time - 3 days,shelf time - 35 days,shelf time - 13 days,shelf time - 5 days,shelf time - 18 days,shelf time - 16 days,shelf time - 17 days,shelf time - 8 days,shelf time - 10 days,shelf time - 12 days,shelf time - 6 days,shelf time - 4 days,

    cost - 10.99cost - .75cost - 89.95cost - 1.75cost - 65.00

    cost - 15.95cost - 59.99cost - 5.95cost - 3.75cost - 2.99cost - 3.76cost - 17.96cost - 2.87cost - 8.95cost - 17.75cost - 98.00cost - 6.97cost - 2.99

    shelf time - 4 days,shelf time - 1 days,shelf time - 22 days,shelf time - 2 days,shelf time - 27 days,

    shelf time - 10 days,shelf time - 32 days,shelf time - 4 days,shelf time - 8 days,shelf time - 3 days,shelf time - 6 days,shelf time - 13 days,shelf time - 1 days,shelf time - 3 days,shelf time - 11 days,shelf time - 23 days,shelf time - 3 days,shelf time - 2 days,

    cost - 12.67cost - 1.19cost - 65.98cost - 2.21cost - 90.00

    cost - 16.00cost - 65.98cost - 6.95cost - 3.79cost - 2.76cost - 3.87cost - 19.44cost - 1.77cost - 2.61cost - 49.95cost - 97.34cost - 8.97cost - 4.55

    shelf time - 7 days,shelf time - 3 days,shelf time - 24 days,shelf time - 4 days,shelf time - 18 days,

    shelf time - 13 days,shelf time - 27 days,shelf time - 14 days,shelf time - 6 days,shelf time - 9 days,shelf time - 3 days,shelf time - 18 days,shelf time - 2 days,shelf time - 3 days,shelf time - 15 days,shelf time - 19 days,shelf time - 4 days,shelf time - 1 days,

    cost - 16.82cost - 5.98cost - 4.82cost - 3.56cost - 156.33

    cost - 14.98cost - 129.34cost - 4.21cost - 5.88cost - 6.98cost - 1.29cost - 89.99cost - 2.86cost - 12.87cost - 56.99cost - 65.49cost - 7.99cost - 5.98

    40

    35

    30

    25

    20

    15

    10

    5

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

    days

    dollars

    ...... ....... ..

    .

    .

    ...

    .

    ....

    ..

    .......

    ..

    ...............

    correlating the shelf time of a product against the cost of the item at final sale

    40

    35

    30

    25

    20

    15

    10

    5

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

    days

    dollars

    ...... ...... . ...

    .

    ...

    .

    .

    .....

    ...... .

    ...............

    ..

    identifying the major trends using a scatter chart

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    11/22

    11

    1997 William H. Inmon, all rights reserved

    The two trends shows that some high-cost items sell very quickly. The length of time other itemsspend on the shelf (roughly) varies according to the price of the item. The contrast between the twotrends that have been detected by the scatter chart are of real interest to the DSS analyst. The naturalquestion to ask is for slow-moving items, what characteristic is there about the item that separatesit from a fast-moving item? If the DSS analyst can make that determination, the business person willhave a real chance to price and advertise items so that the amount of shelf time will be minimal, or

    that the companys profits for long shelf-time items will be optimal.

    Still another way to analyze detailed data is by creating ratios. Figure 2.14 shows some ratios thathave been created from the shelf time versus cost data.

    The first ratio shown in Figure 2.14 is that of the ratio of cost divided by shelf time for the entirepopulation. The value that has been calculated is $4.97.

    The second set of ratios that have been calculated in Figure 2.14 are a profile set of ratios thatcalculate shelf time by dollar value. The set of profiles shown in Figure 2.14 shows that as the price ofan item grows larger, the longer the item can be expected to remain on the shelf. Such profiles ofratios can be very elucidating.

    shelf time - 5 days, cost - 10.99shelf time - 1 day, cost - .75shelf time - 20 days, cost - 89.95shelf time - 3 days, cost - 1.75shelf time - 21 days, cost - 65.00shelf time - 10 days, cost - 15.95shelf time - 3 days, cost - 59.99shelf time - 35 days, cost - 5.95shelf time - 13 days, cost - 3.75shelf time - 5 days, cost - 2.99

    shelf time - 18 days, cost - 3.76shelf time - 16 days, cost - 17.96shelf time - 17 days, cost - 2.87shelf time - 8 days, cost - 8.95shelf time - 10 days, cost - 17.75shelf time - 12 days, cost - 98.00shelf time - 6 days, cost - 6.97shelf time - 4 days, cost - 2.99

    shelf time - 4 days, cost - 12.67shelf time - 1 day, cost - 1.19shelf time - 22 days, cost - 65.98shelf time - 2 days, cost - 2..21shelf time - 27 days, cost - 90.00shelf time - 10 days, cost - 16.00shelf time - 32 days, cost - 65.98shelf time - 4 days, cost - 6.95shelf time - 8 days, cost - 3.79shelf time - 3 days, cost - 2.76

    shelf time - 6 days, cost - 3.87shelf time - 13 days, cost - 19.44shelf time - 1 days cost - 1.77shelf time - 3 days, cost - 2.61shelf time - 11 days, cost - 49.95shelf time - 23 days, cost - 97.34shelf time - 3 days, cost - 8.97shelf time - 2 days, cost - 4.55

    shelf time - 7 days, cost - 16.82shelf time - 3 day, cost - 5.98shelf time - 24 days, cost - 4.82shelf time - 4 days, cost - 3.56shelf time - 18 days, cost - 156.33shelf time - 13 days, cost - 14.98shelf time - 27 days, cost - 129.34shelf time - 14 days, cost - 4.21shelf time - 6 days, cost - 5.88shelf time - 9 days, cost - 6.98

    shelf time - 3 days, cost - 1.29shelf time - 18 days, cost - 89.99shelf time - 2 days, cost - 2.86shelf time - 3 days, cost - 12.87shelf time - 15 days, cost - 56.99shelf time - 19 days, cost - 65.49shelf time - 4 days, cost - 7.99shelf time - 1 day, cost - 5.98

    creating ratios can be very elucidating

    cost/shelf time = 4.97 for the entire population

    cost/shelf time by dollar value -

    from .00 to 2.00 .54from 2.01 to 4.00 1.02from 4.01 to 6.00 1.08from 6.01 to 10.00 2.17from 10.01 to 15.00 4.96from 15.01 to 50.00 10.98greater than 50.00 10.42

    creating ratios can be one metric that can bring to lightmany different examples that are of interest

    Figure 2.14

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    12/22

    12

    1997 William H. Inmon, all rights reserved

    There is a word of caution about ratios. Ratios can be very deceiving. A ratio says nothing about thenumber of calculations that went into the ratio. A ratio representing 1,000,000 calculations is a muchmore stable and believable ratio than one that represents 10 calculations. Figure 2.15 makes this point.

    In Figure 2.15 some ratios have as many as 15,000 calculations that were made to form the ratio.Other ratios have had less than 1,000 calculations as the basis for their calculation. The morecalculations a ratio represents, the more believable the ratio.

    The micro techniques of data analysis that have been discussed are valid and appropriate for data inthe data warehouse environment. However, given the volumes of data that are found in a datawarehouse and the volumes of data on which data mining and data exploration is done, it isunreasonable to think that micro analysis will be done on the entire body of data. There simply istoo much data to entertain such an activity. Instead, from a strategic standpoint, it is necessary tobreak the body of data into subpopulations for analysis, then use the techniques of micro analysison those subpopulations.

    The selection of the subpopulation is one of the secrets to doing effective data mining. As a rule, it iswise to select intuitive subsets. Looking at demographics to select one subset of the population is agood place to start.

    cost/shelf time by dollar value -

    from .00 to 2.00 .54from 2.01 to 4.00 1.02from 4.01 to 6.00 1.08from 6.01 to 10.00 2.17from 10.01 to 15.00 4.96from 15.01 to 50.00 10.98greater than 50.00 10.42

    12,9809,871

    14,9717,5172,9811,987

    716

    number of occurrences

    in order to be most effective, ratios should be accompaniedby the number of occurrences used in calculating the ratio

    Figure 2.15

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    13/22

    13

    1997 William H. Inmon, all rights reserved

    Figure 2.16 shows the strategic approach to the selection of subpopulations.

    Even if it were possible to do a massive analysis across huge volumes of data, the presence anddiscovery of false positives makes such an approach unwise.

    As a rule the large population of data is divided into large subsets. The subsets then set theparameters for profiles of data across the population.

    Figure 2.17 shows some of the common ways that subdivision of data and the creation of profilescan be done.

    subset A

    subset B

    subset C

    subset D

    breaking the large mass into subsets and identifying thedistinguishing characteristics of each subset is anextremely useful thing to do. Each subset will have itsown peculiarities, and the differences between subsetscan present great marketing opportunities

    Figure 2.16

    12

    3

    6

    9

    accidents between:

    5:00 am - 6:00 am - 236:00 am - 7:00 am - 987:00 am - 8:00 am -1038:00 am - 9:00 am -291........................................

    small businesses by county -

    Otero - 951El Paso - 8,975Terrell - 10,876Bexar - 128,081Austin - 87,917................................

    population by characteristic -

    female, 21-45 years, college degree - 29female, 46 and older, college degree - 34male, 21-45 years, college degree - 62male, 46 and older, college degree - 13

    profiles are a good way tocharacterize a population. Threecommon ways of creating a populationis by time, by location, or by

    demographics

    Figure 2.17

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    14/22

    14

    1997 William H. Inmon, all rights reserved

    Figure 2.17 shows that profiles can be created along the boundaries of time, geography, ordemographics of the population. These three ways of creating subsets of a population are verynormal and correspond to the ways that businesses react with their customers.

    However, there is nothing that says that subpopulations which are studied by the data analyst haveto be broken up in one dimension. It is entirely possible (even likely) that multiple dimensions can be

    used to sub divide data, as seen in Figure 2.18.

    In Figure 2.18 the three dimensions have been used to subdivide the population that is studied by theDSS analyst. There is one set of data for 1980, another for 1990, and so forth. Within the 1980 data,there is a subdivision of data by state, then by gender within state. Each of the subclassifications ofdata has then has its micro analysis of data done on it.

    Exploitation

    Once the patterns have been discovered and the subpopulations been discerned, how does the DSSanalyst go about exploiting the correlations that have been discovered? The first thing the DSSanalyst can do is to look in the right place. And exactly where are the right places? (Or at least wherehave been the most productive places to look for exploitation?) Figure 2.19 illustrates the mostproductive places to look for opportunities for exploitation.

    Texas - females over 21 - 12,987,245

    Texas - females under 21 - 3,776,430Texas - males over 21 - 12,389,339Texas - males under 21 - 3,798,229Arkansas - females over 21 - 2,872,228Arkansas - females under 21 - 640,449Arkansas - males over 21 - 2,640,991Arkansas - males under 21 - 625,981.............................................................................................................................

    Texas - females over 21 - 13,997,302

    Texas - females under 21 - 3,807,208Texas - males over 21 - 12,902,810Texas - males under 21 - 3,982,228Arkansas - females over 21 - 2,972,220Arkansas - females under 21 - 643.990Arkansas - males over 21 - 2,972,220Arkansas - males under 21 - 638,449.............................................................................................................................

    1980 1990

    of course, the dimensions of demographics can becombined. In this case, time, location, anddemographics are represented in a single profile

    Figure 2.18

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    15/22

    15

    1997 William H. Inmon, all rights reserved

    Figure 2.19 shows that the most productive places to look for opportunities for exploitation havebeen as near to:

    s

    the mainline business as possible,s the cash flow of the business as possible,s the customer as possible, ands the sale as possible.

    The mainline business is, of course, different for every company. A manufacturer looks to the efficientproduction and packaging of goods. A retailer looks to the stocking of the SKU and its trackingduring the sales process. A banker looks to the analysis of profitability. An insurance executive looks

    where to do data mining?

    UP

    $ CASHIER $

    as near to the mainline businessas possible

    as near to the cash flow of the companyas is possible

    as near to the customer as possible

    1.75

    as near to the sale as possible

    Figure 2.19

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    16/22

    16

    1997 William H. Inmon, all rights reserved

    to the selling of new policies, the handling of claims, the risk of going into new lines of business. Thetelecommunications manager looks to the acquisition of market share and the satisfaction of thecustomer. Each business is different in this regard.

    There indeed are novel ways to use data mining and data exploration outside of the arenasdiscussed; however, the classical use of the techniques of mining have been in the arena discussed.

    The essence of the data miner is to be able to anticipate the marketplace and to be able to anticipate itbefore the competition does so. In anticipation, the data miner is able to achieve marketplaceopportunity. Figure 2.20 shows that through data mining the corporation knows where the marketplaceis heading or is able to discover niches in the marketplace that have been otherwise overlooked.

    Exactly how can the marketplace be anticipated? There are a host of ways that anticipation can bedone, including:

    s direct sales opportunities,s the creation of new products and new packaging of old products,s new sales to undiscovered market niches,s establishment of market share and market momentum,s niche advertising,s the influencing of buying patterns of the customer, ands the establishment of initial buying patterns (in the hopes that the initial buying pattern will

    be the start of a long term habit).

    Through the use of correlations, the DSS analyst creatively uses the correlation to position productsand packages to take advantage of the information about the habits of consumers.

    Figure 2.20

    anticipating the marketplace -

    - direct sales- new products- new sales- establish market share- niche advertising- influence buying patterns- establish initial buying pattern

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    17/22

    17

    1997 William H. Inmon, all rights reserved

    To be a bit more specific about how exploitation might occur, consider the suggestions made inFigure 2.21.

    Reusing Exploration

    There is no doubt that data mining and exploration requires a large amount of resources, howeversuccessful or unsuccessful the mining and exploration might be. Figure 2.22 shows that the datamining and data exploration effort consumes considerable resources.

    In order to make the most use of resources as possible, it is a wise decision to try to salvage as muchof the data mining and data exploration as possible. At the end of the data mining and dataexploration effort, rather than throw everything away, trying to put as much in order as possible so

    that the next data mining and data exploration effort that arises can make use of work that hasalready been done is a wise policy. If every new data mining and data exploration effort must buildtheir entire infrastructure from scratch, there is the likelihood of massive amounts of wasted andduplication of effort. Figure 2.23 shows that there is considerable infrastructure that might be savedfrom one data mining effort to the next.

    Figure 2.21

    analysis marketshare

    a lot of work goes into creating an effective exploitation of information

    Figure 2.22

    event A event B

    when event A happens, event B usually happens

    - make sure that facilitators of A are in sync with facilitators of B- tie A and B together as a product- tie A and B together in a package- tie A and B together in an advertisement- promote A and B together- restock A and B on the same schedule- position A and B closely together in the same location- when a change is detected in A anticipate the same change in B- continue to check the strength of the A and B relationship- continue to check the relationship of A and B with other variables- attempt to understand the business basis of A and B

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    18/22

    18

    1997 William H. Inmon, all rights reserved

    The first step of doing data mining and data exploration from the data warehouse is that of theanalysis done by the explorer. The types of work results that might be saved by the explorer include:

    s intermediate analysis results,s analytical programs, ands a metadata record of the iterative processing that has occurred.

    Figure 2.24 shows the results of analysis that might be saved in preparation for the next exploration.

    In addition to the intermediate results that might be saved, there are some other dimensions ofheuristic processing that might be saved as well. Figure 2.25 suggest some of the DSS parameters thatmay be of use to future DSS analysts.

    analysis marketshare

    when it comes time to do another data mining/data explorationeffort, it is a waste to have to recreate the entire infrastructurefrom scratch all over again

    Figure 2.23

    analysismetadata

    intermediateanalyses

    analyticalprograms

    metadata

    storing intermediate analysis, analytical programs,and metadata are all important to enabling the dataminer/data explorer to be able to build on the workthat has already been done

    Figure 2.24

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    19/22

    19

    1997 William H. Inmon, all rights reserved

    In Figure 2.25 it is suggested that such things as:s assumptions,s sequence of iterative processing,s a description of each of the iterations,s a description of the structure of the data read and the results achieved, ands a description of the base data and its state

    are all useful descriptions for the analyst trying to reconstruct the work that has been done and tosalvage partial results.

    The analytical results that have been achieved are also of interest to the explorer who wishes to build onprevious efforts. Figure 2.26 shows some of the details that describe the results that might be saved.

    analysismetadata

    intermediateanalyses

    analyticalprograms

    metadata

    of special importance are -

    - assumptions- sequence of processing- description of iterations- structure of data- state of base data

    there are some aspects of processing that are peculiar to theworld of DSS analytical processing that are not found elsewhere

    in the world of operational or other DSS processing

    Figure 2.25

    analysis

    analyticalresults

    analytical results -- reports- spreadsheets- temporary analysis- distilled data- multi dimensional data

    if analytical results are to be reusedcertain information about the analysismust be retained and stored in anaccessible format

    - description of content- sequence of iteration- analyst name- time of analysis- creating program- assumptions- interpretation of result- next iteration embarked on- related documentation- physical name of analysis- location of analysis- size of analysis

    Figure 2.26

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    20/22

    20

    1997 William H. Inmon, all rights reserved

    And finally, the results achieved by the attempts at exploitation are measured in the marketplace.Figure 2.27 shows the traditional marketplace measurements.

    Tools

    The tools that are used for data mining and data exploration can be divided into one of severalcamps. Figure 2.28 shows the different classifications of tools useful in the world of data mining anddata exploration.

    analysismarketshare

    measuring the results of exploitation -

    - increased market share- increased sales- increased profitability- expanded product line- increased revenue- increased "pull through" sales

    Figure 2.27

    analysis market

    share

    tools for data mining/exploration

    1

    3

    4

    2

    1 - DATA WAREHOUSE

    - data warehouse integration, transformation- data warehouse metadata- relational dbms- parallel hardware platforms

    2 - ANALYSIS

    - DSS analyst metadata tools- query formulation- query storage

    - spreadsheet- multidimensional dbms- departmental platforms/

    work stations- graphical displays- statistical processing- data selection, transport

    3 - ANALYSIS PREPARATION

    - graphical display

    4 - EXPLOITATION

    - corporate measurement of results

    some of the tools for data mining/data exploration

    Figure 2.28

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    21/22

    21

    1997 William H. Inmon, all rights reserved

    Summary

    The two Tech Topics on data mining and data exploration began with a discussion of an approach todata mining. The steps in the approach are:

    s infrastructure developments exploration

    s analysiss interpretations exploitation

    The steps in the approach are heuristically derived. As such, there is no particular order in which thesteps are executed.

    There are two types of DSS analyst explorers and farmers. Explorers are people who do not knowwhat they want, but will know it when they find it. Farmers are people who know what they want andregularly find it. Both types of people are served in the interest of data mining and data exploration.

    Data mining and data exploration occurs at two levels micro and macro. Both levels of explorationare necessary in order to be effective.

    The basis of data mining is correlating different types of data to each other. There are many ways tocorrelate different types of data by units of data, by time, by geography, by demographics, and soforth. When correlations are spotted, trends can be developed. There are different kinds of trends:

    s long term trends,s short term trends, ands sub trends, etc.

    The data warehouse sets the stage for data mining and data exploration for a variety of reasons, such as:s The stability of the data in the data warehouse. The DSS analyst knows that the changes in

    results are a result of the changes in hypothesis, not in the data when using a data warehouse.s The integrated data found in the warehouse.s The historical data found in the warehouse.s The metadata that describes the content of the warehouse.s The summary data found in the warehouse, and so forth.

    The largest problem the DSS analyst has in doing data mining and data exploration is that of themanagement of the volume of data found in the data warehouse. There is so much data that the DSSanalyst drowns in it.

    Not only does the volume of data hide important relationships, the sheer volume of data ensures thatsome false positives will be found.

    Sampling is a good technique to cut down the volumes of data that need to be manipulated. Randomsamples can be drawn. Judgment samples can be drawn. If the judgment samples are drawn along

    the lines of demographic analysis, the sampling can be a very powerful technique.

    Summary data can be very instructive as to directing the DSS analyst where to look forinteresting correlations.

    The intuition of the experienced DSS analyst should not be discounted as well.

  • 7/27/2019 Inmon - Data Mining - Exploring the Data

    22/22

    Correlations between variables can be analyzed in many ways. One way is through the existence ofthe variables to each other. Another way is through the analysis of the values of the variables as theyoccur with each other.

    Once the mathematical basis is discovered, the business basis for the relationship needs to beexamined. In some cases, there will be no business basis, and in other cases there will be a business

    basis. When the business basis of the relationship is discovered, the result can be interpreted andexploitation can occur.

    Analysis occurs at a macro level and a micro level. At the micro level there are many ways to analyzerelationships of data. Some of the ways include:

    s looking at the mean,s creating a profile,s looking at the highest and lowest,s looking at the median,s creating scatter diagrams,s creating ratios, and so forth.

    Profiles are a good way to characterize a population. Profiles can be made in many ways by time,

    by geography, and by demographics, or by a combination of the characteristics.

    The best chance of exploitation classically has been as near to :s the mainline business as possible,s the cash flow of the company as possible,s the customer as possible, ors the sale as possible.

    The marketplace can be anticipated in many ways. Some of the classical positionings of themarketplace have been:

    s in direct sales,s in new products and new packages,s in finding new sales,s in establishing market share,s in niche advertising and promotions,s influencing buying patterns,s in establishing new buying patterns, and so forth.