GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland

GUHA method in Data MiningEsko Turunen

Tampere University of TechnologyTampere, Finland

Data Mining in a NutshellKnowledge discovery in databases (KDD) was initially defined as the ‘non-trivial extraction of implicit, previously unknown, and potentially useful information from data’ [Frawley, Piatetsky-Shapiro, Matheus, 1991]. A revised version of this definition states that ‘KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data’ [Fayyad, Piatetsky-Shapiro, Smyth, 1996].According to this definition, data mining is a step in the KDD process concerned with applying computational techniques (i.e., data mining algorithms implemented as computer programs) to actually find patters in the data. In a sense, data mining is the central step in the KDD process. The other steps in the KDD process are concerned with preparing data for data mining, as well as evaluating the discovered patterns, the results of data mining.I Data. The input to a data mining algorithm is most commonly a single flat table comprising a number of fields (columns) and records (rows). In general, each row represents an object and columns represent properties of objects.II Typical data mining tasks.- Classification and regression; the task is to predict the value of one field from other fields. If the class is continuous, the task is called regression. If the class is discrete the task is called classification.- Clustering is concerned with grouping objects into classes of similar objects. A cluster is a collection of objects that are similar to each other and are dissimilar to objects in other clusters.- Association analysis is the discovery of association rules. Association rules specify correlation between frequent item sets.- Data characterisation sums up the general characteristics or features of the target class of data: this class is typically collected by a database query.

- Outlier detection is concerned with finding data objects that do not fit the general behaviour or model of the data: these are called outliers.- Evaluation analysis describes and models regularities or trends whose behaviour changes over time.III Outputs of data mining procedures can be- Equations e.g. TotalSpent = 189.5275 x Age + 7146.89 [€]- Decision trees, e.g.

Income

100.000 € > 100.000 €

YesAge

58 > 58

YesNo

- Predictive rules of a form IF Conjunction of conditions THEN Conclusion, e.g. IF income is 100.000 € and Gender = Male THEN not a Big Spender- Association rules e.g. {Gender = ‘Female’, Age = ‘>52’} {Big Spender = ‘Yes’}

- Distance and similarity measures e.g. ),...y(y ),,...x(x where,)y(x),d( n1n1

n

1i

2ii

yxyx

- Probabilistic models e.g. Bayesian networks (For more details see Saso Dzeroski’s Relational Data Mining)--------------------------------------------------------------------------------------------------------------------Our aim is to study in details a particular data mining method called GUHA and it’s computer implementation called LISp Miner. This approach is essentially as association analysis, however,classification, clustering and outlier detection tasks can be carried out by this method.

Documents

GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland