Upload
shyjutestingshyju
View
221
Download
0
Embed Size (px)
Citation preview
Data Mining: Looking Beyond the Tip of theIceberg
Sarabjot S. Anand and John G. Hughes Faculty of InformaticsUniversity of Ulster (Jordanstown Campus) Northern Ireland
1. What is Data Mining?
Over the past two decades there has been a huge increase in the amount of data being stored in databases as well as
the number of database applications in business and the scientific domain. This explosion in the amount ofelectronically stored data was accelerated by the success of the relational model for storing data and the developmentand maturing of data retrieval and manipulation technologies. While technology for storing the data developed fast to
keep up with the demand, little stress was paid to developing software for analysing the data until recently when
companies realised that hidden within these masses of data was a resource that was being ignored. The huge amounts
of stored data contains knowledge about a number of aspects of their business waiting to be harnessed and used for
more effective business decision support. Database Management Systems used to manage these data sets at present
only allow the user to access information explicitly present in the databases i.e. the data. The data stored in the
database is only a small part of the 'iceberg of information' available from it. Contained implicitly within this data is
knowledge about a number of aspects of their business waiting to be harnessed and used for more effective business
decision support. This extraction of knowledge from large data sets is called Data Mining or Knowledge Discovery in
Databases and is defined as the non-trivial extraction of implicit, previously unknown and potentially useful information
from data [FRAW91]. The obvious benefits of Data Mining has resulted in a lot of resources being directed towards
its development.
Almost in parallel with the developments in the database field, machine learning research was maturing with the
development of a number of sophisticated techniques based on different models of human learning. Learning by
example, cased-based reasoning, learning by observation and neural networks are some of the most popular learning
techniques that were being used to create the ultimate thinking machine.
Figure 1: Data Mining
While the main concern of database technologists was to find efficient ways of storing, retrieving and manipulatingdata, the main concern of the machine learning community was to develop techniques for learning knowledge from
data. It soon became clear that what was required for Data Mining was a marriage between technologies developedin the database and machine learning communities.
Data Mining can be considered to be an inter-disciplinary field involving concepts from Machine Learning, Database
Technology, Statistics, Mathematics, Clustering and Visualisation among others.
So how does Data Mining differ from Machine Learning? After all the goal of both technologies is learning from data.Data Mining is about learning from existing real-world data rather than data generated particularly for the learning
tasks. In Data Mining the data sets are large therefore efficiency and scalability of algorithms is important. Asmentioned earlier the data from which data mining algorithms learn knowledge is already existing real-world data.
Therefore, typically the data contains lots of missing values and noise and it is not static i.e. it is prone to updates.However, as the data is stored in databases efficient methods for data retrieval are available that can be used to make
the algorithms more efficient. Also, Domain Knowledge in the form of integrity constraints is available that can be used
to constrain the learning algorithms search space.
Data Mining is often accused of being a new buzz world for Database Management System (DBMS) reports. This isnot true. Using a DBMS Report a company could generate reports such as:
Last months sales for each service type
Sales per service grouped by customer sex or age bracketList of customers who lapsed their insurance policy
However, using Data Mining techniques the following questions may be answered
What characteristics do my customers that lapse their policy have in common and how do they differ from mycustomers who renew their policy?
Which of my motor insurance policy holders would be potential customers for my House ContentInsurance policy?
Clearly, Data Mining provides added value to DBMS reports and answers questions that DBMS reports cannot
answer.
2. Characteristics of a Potential Customer for Data Mining
Most of the challenges faced by data miners stem from the fact that data stored in real-world databases was not
collected with discovery as the main objective. Storage, retrieval and manipulation of the data were the main
objectives of the data being stored in databases. Thus most companies interested in data mining poses data with the
following typical characteristics:
The stored data is large and noisy
Conventional methods of data analysis are not useful due to the complexity of the data structures and the size of
the dataThe data is distributed and heterogeneous due to most of the data being collected over time in legacy systems
The sheer size of the databases in real-world applications causes efficiency problems. The noise in the data and
heterogeneity cause problems in terms of accuracy of the discovered knowledge and complexity of the discoveryalgorithms required.
3. Aspects of Data Mining
In this section we discuss a number of issues that need to be addressed by any serious data mining package.
Uncertainty Handling: Nothing is certain in this world and therefore any system that tries to model a real-
world scenario must allow a representation for uncertainty. A number of uncertainty models have been
proposed in the Artificial Intelligence community. Though no consensus has been arrived at as to which modelis best it is recognised that attention must be paid on the selection of a model that is suitable for the problem at
hand. Most Data Mining systems tend to employ the Bayesian Probability model though some support for
Fuzzy Logic, Rough Sets and Evidence Theory has been shown as well.
Dealing with Missing Values: Missing Values can occur in databases due to two reasons: Firstly, a valuemay not be available at the present time (incomplete information) and secondly, no value may be appropriate
due to some other attributes value in the tuple. Within the relational model missing values are represented as
NULLs. Facilities must be provided to deal with NULL values within a Data Mining system either by filling in
these values before the discovery process is undertaken or by taking NULLs into account within the discovery
process, perhaps by using a model of uncertainty like Evidence Theory that allows an explicit expression for
ignorance. A number of methodologies have been suggested in machine learning literature e.g. NULL as an
attribute value, using the most common attribute value and decision tree techniques.Dealing with Noisy data: Noise in data in real-world databases are a fact of life. Discovery techniques used
for Data Mining therefore need to be able to handle noisy data. As compared to symbolic learning techniques
like decision tree induction, Neural Network techniques tend to generalise and learn classification knowledge
better in the presence of knowledge. Though a number of techniques based on statistics have been used inmachine learning techniques more robust techniques are required in Data Mining for dealing with noise if useful
discovery from data is to be performed.
Efficiency of algorithms: Machine Learning algorithms though highly sophisticated and general get very
inefficient when used for learning from large data sets. In Data Mining the data sets are very large and thereforethe need to create new efficient, more specific algorithms is very important.
Constraining Knowledge Discovered to only Useful or Interesting Knowledge: From large amounts
data an even larger amount of knowledge can be discovered. Therefore what is required is techniques that
prioritise the knowledge in terms of its usefulness or interesting to the present needs of the user. At present theuncertainty and support of the knowledge, knowledge about the user domain and some measure of
interestingness is used. The measure of interestingness is accepted as being a subjective measure as what is
interesting to one user may be of no interest to another. However, some aspects of interestingness can be
automated and a number of measures have been suggested e.g. the J-measure by Smyth and Goodman andPiatetsky-Shapiros measure based on statistical independence.
Incorporating Domain Knowledge: Very often some reliable knowledge about he discovery domain may be
available to the user. An important question is how to use this knowledge to discover better knowledge in a
more efficient way.Size and Complexity of Data: As compared to machine learning problems the data sets in Data Mining are
much larger, noisier and incomplete. Also the data used for discovering knowledge in Data Mining was not
collected or stored for the purpose of discovery. Most data has been collected over a period of time and lies in
different formats in legacy systems. Thus, heterogeneity and distribution of data is of particular interest to DataMining. Techniques are required for integrating heterogeneous and distributed data.
Data Selection: Due to the large amounts of data, efficiency of the Data Mining algorithms is important. One
way of improving the efficiency of Data Mining techniques is by reducing the amount of data. A lot of work has
been done in Machine Learning with respect to relevance. Similar techniques need to be employed in DataMining.
Understandability of Discovered Knowledge: Knowledge discovered using Data Mining techniques must
be in a form that can be understood by the user as in the end of the day a user will only be able to use the
knowledge for decision making if he or she is able to understand the knowledge. This is the main failing ofNeural Networks as they are unintelligible black boxes. Decision Trees can get very large and opaque when
using a large training data set.
Consistency between Data and Discovered Knowledge: Data stored in databases may be updated from
time to time. Techniques are required for updating the knowledge discovered from the data so that it is
consistent with updates made to the data.
4. Classification of Data Mining Problems
Agrawal et. al [AGRA93] classify Data Mining problems into three categories :
Classification : Consider a bank that gives loans to its customers. The bank would obviously find it useful to be
able to predict which new customer would be a good investment and which one would not. Using data
collected about the previous customers, the bank would like to know the attributes that make a customer a
good investment or a bad investment. What is required is a set of rules that partition the data into two exclusivegroups - one of good investments and the other of bad investments. Such rules are called classification rules as
they classify the given data into a fixed number of groups. The data on old customers (for whom the group that
they belong to is known) is called the training set from which the classification rules are discovered. The
classification rules can then be used to discover as to which group a new customer belongs to.
Two approaches have been employed within machine learning to learn classifications. They are Neural Network
based approaches and Induction based approaches. Both approaches have a number of advantages and
disadvantages. Neural Networks may take longer to train than a rule induction approach but they are known to bebetter at learning to classify in situations where the data is noisy. However, as it is difficult to explain why a Neural
Network made a particular classification they are dismissed as unsuitable for real Data Mining. Rule Induction based
approaches to classification are normally Decision Tree based. Decision Trees can get very large and cumbersome
when the training set is large, which is the case in Data Mining, and though they are not black boxes like Neural
Networks become difficult to understand as well.
Both Neural Networks [ANAN95] and Tree Induction [AGRA92] techniques have been employed for Data Mining
as well along with Statistical techniques [CHAN91].
Association : This involves rules that associate one attribute of a relation to another. For example,
if we have a table containing information about people living in Belfast, an association rule could be of the type
(Age < 25) Ù (Income > 10000) ® (Car_model = Sports)
This rule associates the Age and Income of a person to the type of car he drives.
Set oriented approaches AGRA93, AGRA94, AGRA95] developed by Agrawal et. al. are the most efficient
techniques for discovery of such rules. Other approaches include attribute-oriented induction techniques [HAN 94],
Information Theory based Induction [SMYT91] and Minimal-Length Encoding based Induction [PEDN91].
Sequences : This involves rules that are based on temporal data. Suppose we have a database of natural
disasters. From such a database if we conclude that whenever there was an earthquake in Los Angeles, the
next day Mt. Kilimanjaro erupted, such a rule would be a sequence rule. Such rules are useful for making
predictions which could be useful in making market gains or for taking preventive action against naturaldisasters. The factor that differentiates sequence rules from other rules is the temporal factor.
Techniques have been developed for discovering sequence relationships using Discrete Fourier Transforms to
map time sequences to the frequency domain [AGRA93, AGRA95]. This technique is based on two
observations:
for most sequences of practical interest only the first few frequencies are strong
Fourier transforms preserve the Euclidean distance in the time or frequency domain
Another technique uses Dynamic Time Warping, a technique used in the speech recognition field, to find patterns in
temporal data [BERN94].
5. A Data Mining Model
5.1 Data Pre-processing: As mentioned in section 2, data stored in the real-world is full of anomalies that need to
be dealt with before sensible discovery can be made. This Data Pre-processing/ Cleansing may be done using
visualisation or statistical tools. Alternatively, a Data Warehouse (see section 8.1) may be built prior to the DataMining tools being applied.
Data Pre-processing involves removing outliers in the data, predicting and filling-in missing values, noise reduction,
data dimensionality reduction and heterogeneity resolution. Some of the tools commonly used for data pre-processing
are interactive graphics, thresholding and Principal Component Analysis.
5.2 Data Mining Tools: The Data Mining tools consist of the algorithms that automatically discover patterns from
the pre-processed data. The tool chosen depends on the mining task at hand.
5.3 User Bias: The User is central to any discovery/ mining process. User Biases in the Data Mining model are a
way for the User to direct the Data Mining tools to areas in the database that are of interest to the user. User Bias
may consist of:
Attributes of Interest in the databases
Goal of discovery
A minimum degree of support and confidence in any knowledge discovered
Domain KnowledgePrior Knowledge/ Beliefs about the domain
6. Data Mining Technologies
Various approaches have been used for Data Mining based on inductive learning [ IMAM93], Bayesian statistics
[WUQ91], information theory [SMYT91], fuzzy sets [YAGE91], rough sets [ZIAR91], Relativity strength
[BELL93], Evidence Theory [ANAN95] etc.
6.1 Machine Learning
To be able to make a machine mimic the intelligent behaviour of humans has been a long standing goal of Artificial
Intelligence researchers who have taken their inspiration from a variety of sources such as psychology, cognitive
science and neurocomputing.
Machine Learning paradigms can be classified into two classes: Symbolic and Non-symbolic paradigms. Neural
Networks are the most common non-symbolic paradigm while rule-induction is a symbolic paradigm.
6.1.1 Neural Networks: Neural Networks is a Non-Symbolic paradigm of Machine Learning that finds its
inspiration from Neuroscience. The realisation that most Symbolic Learning Paradigms are not satisfactory in a
number of domains e.g. pattern recognition, that are regarded by humans as trivial lead to research into trying to
model the human brain.
The human brain consists of a network of approximately 1011 neurones. Each biological neurone consists of a
number of nerve fibres called dendrites connected to the cell body where the cell nucleus is located. The axon is a
long, single fibre that originates from the cell body and branches near its end into a number of strands. At the ends ofthese strands are the transmitting ends of the synapse that connect to other biological neurones through the receiving
ends of the synapse found on dendrites as well as the cell body of biological neurones. A single axon typically makes
thousands of synapses with other neurones. The transmission process is a complex chemical process which effectively
increases or decreases the electrical potential within the cell body of the receiving neurone. When this electrical
potential reaches a threshold value (action potential) it enters it's excitatory state and is said to fire. It is the
connectivity of the neurones that give these simple 'devices' their real power. The figure above shows a typical
biological neurone.
An Artificial Neurone [HERT91] (or processing elements, PE) are highly simplified models of the biological neurone
(see figure). As in biological neurones an artificial neurone has a number of inputs, a cell body (consisting of the
summing node and the Semi-Linear function node in the figure) and an output which can be connected to a number of
other artificial neurones.
Neural Networks are densely interconnected networks of PE's together with a rule to adjust the strength of the
connections between the units in response to externally supplied data. Using neural networks as a basis for a
computational model has its origins in pioneering work conducted by McCulloch and Pitts in 1943 [McCU43]. They
suggested a simple model of a neurone that commuted the weighted sum of the inputs to the neurone and output a 1 ora 0 according to weather the sum was over a threshold value or not. A zero output would correspond to the inhibitory
state of the neurone while a 1 output would correspond to the excitatory state of the neurone. But the model was far
from a true model of a biological neurone as for a start the biological neurones output is a continuous function rather
than a step function. The step function has been replaced by other more general, continuous functions called
activation functions. The most popular of these is the sigmoid function defined as:
f(x) = 1/(1+e-x)
The overall behaviour of a network is determined by its connectivity rather than by the detailed operation of any
element. Different topologies for neural networks are suitable for different tasks e.g. Hopfield Networks for
optimization problems, Multi-layered Perceptron for classification problems and Kohonen Networks for data coding.
There are three main ingredients to a neural network :
the neurones and the links between them
the algorithm for the training phase
a method for interpreting the response from the network in the testing phase
The learning algorithms used are normally iterative e.g. back-propagation algorithm attempting to reduce the error in
the output of the network. Once the error is reduced (not necessarily minimum) the network can be used to classify
other unseen objects.
Though neural networks seem an attractive concept they have a number of disadvantages. Firstly, the learning process
is very slow and compared to other learning methods. The learned knowledge is in the form of a network and it is
difficult for a user to interpret it (the same is a disadvantage of using decision trees). User intervention in the learning
process, interactively is difficult to incorporate which is normally required in Data Mining applications. However,
neural networks are known to perform better that symbolic learning techniques in noisy data found in most real-worlddata sets.
6.1.2 Rule Induction: Automating the process of learning has enthralled AI researchers for some years now. The
basic idea is to build a model of the environment using sets of data describing the environment. The simplest model
clearly is to store all the states of the environment along with all the transitions between them over time. For example,
a chess game may be modelled by storing each state of the chess board along with the transitions from one state to the
other. But the usefulness of such a model is limited as the number of states and transitions between them are infinite.
Thus, it is unlikely that a state that occurs in the future would match, exactly, a state from the past. Thus, a better
model would be to store abstractions / generalizations of the states and the associated transitions. The process of
generalization is called Induction.
Each generalization of the states is called a class and has a class description associated with it. The class description
defines the properties that a state must have to be a member of the associated class.
The process of building a model of the environment using examples of states of the environment is called Inductive
Learning. There are two basic types of inductive learning :
Supervised Learning
Unsupervised Learning
In Supervised Learning the system is provided with examples of states and class labels for each of the
examples defining the class that the example belongs to. Supervised Learning techniques are then used on the
examples to find a description for each of the classes. The set of examples is called the training data set. Supervised
learning may be classified into Single Class Learning and Multiple Class Learning.
In Single Class Learning the supervisor defines a single class by providing examples of states belonging to that class
(positive examples). The supervisor may also provide examples of states that do not belong to that class (negative
examples). The inductive learning algorithm then constructs a class description for the class that singles out instances
of that class from other examples.
In Multiple Class Learning the examples provided by the supervisor belong to a number of classes. The inductivelearning algorithm constructs class description for each of the classes that distinguish states belonging to one class from
those belonging to another.
In Unsupervised Learning the classes are not provided by a supervisor. The inductive learning algorithm has to
identify the classes by finding similarities between different states provided as examples. This process is called
learning by observation and discovery.
6.2 Statistics: Statistical techniques may be employed for data mining at a number of stages of the mining process.
Infact statistical techniques have been employed by analysts to detect unusual patterns and explain patterns using
statistical models. However, using statistical techniques and interpreting their results is difficult and requires a
considerable amount of knowledge of statistics. Data Mining seeks to provide non-statisticians with useful information
that is not difficult to interpret. We know discuss how statistical techniques can be used within Data Mining.
(1) Data Cleansing: The presence of data which are erroneous or irrelevant (outliers) may impede the mining
process. Whilst such data therefore need to be distinguished, this task is particularly sensitive, as some outliers may be
of considerable interest in providing the knowledge that mining seeks to find: 'good' outliers need to be retained, whilst'bad' outliers should be removed. Bad outliers may arise from sources such as human or mechanical errors in
experimental measurement, from the failure to convert measurements to a consistent scale, or from slippage in time-
series measurements. Good outliers are those outliers that may be characteristic of the real world scenario being
modelled. While these are often of particular interest to users, knowledge about them may be difficult to come by and
is frequently more critical than knowledge about more commonly occurring situations. The presence of outliers may be
detected by methods involving thresholding the difference between particular attribute values and the average, using
either parametric or non-parametric methods.
(2) Exploratory Data Analysis: Exploratory Data Analysis (EDA) concentrates on simple arithmetic and easy-to-
draw pictures to provide Descriptive Statistical Measures and Presentation, such as frequency counts and table
construction (including frequencies, row, column and total percentages), building histograms, computing measures of
location (mean, median) and spread (standard deviation, quartiles and semi inter-quartile range, range).
(3) Data Selection: In order to improve the efficiency and increase the time performance of data analysis, it is
necessary to provide sampling facilities to reduce the scale of computation. Sampling is an efficient way of finding
association rules, and resampling offers opportunities for cross-validation. Hierarchical data structures may beexplored by segmentation and stratification.
(4) Attribute re-definition: We may define new variables which are more meaningful than the previous e.g. Body
Mass Index (BMI) = Weight / Height squared. Alternatively we may want to change the granularity of the data e.g.
age in years may be grouped into age groups 0-20 years, 20-40 years, 40-60 years, 60+ years.
Principal Component Analysis (PCA) is of particular interest to Data Mining as most Data Mining algorithms have
linear time complexity with respect to the number of tuples in the database but are exponential with respect to the
number of attributes of the data. Attribute Reduction using PCA thus provides a facility to account for a large
proportion of the variability of the original attributes by considering only relatively few new attributes (called Principal
Components) which are specially constructed as weighted linear combinations of the original attributes. The first
Principal Component (PC) is that weighted linear combination of attributes with the maximum variation; the second
PC is that weighted linear combination which is orthogonal to the first PC whilst maximising the variation, etc. The new
attributes formed by PCA may possibly themselves be assigned individual meaning if domain knowledge is invoked,
or they may be used as inputs to other Knowledge Discovery tools. The facility for PCA requires the partial
computation of the eigensystem of the correlation matrix, as the PC weights are the eigenvector components, with theeigenvalues giving the proportions of the variance explained by each PC.
(5) Data Analysis: Statistics provides a number of tools for data analysis some of which may be employed within
Data Mining. These include:
Measures of Association and Relationships between attributes, such as computation of expected
frequencies and construction of cross-tabulations, computation of chi-squared statistics of association,
presentation of scatterplots and computation of correlation coefficients. The interestingness of rules may be
assessed by considering measures of statistical significance [PIAT91].
Inferential Statistics for hypothesis testing, such as construction of confidence intervals, parametric and non-
parametric hypothesis tests for average values and for group comparisons.
Classification may be carried out using discriminant analysis (supervised) or cluster analysis (unsupervised).
6.3 Database Approaches - Set Oriented Approaches: Set-oriented approaches to data mining attempt to
employ facilities provided by present day DBMSs to discover knowledge. This allows the use of years of research
into database performance enhancement to be used within Data Mining processes. However, SQL is very limited in
what it can provide for Data Mining and therefore techniques based solely on this approach are very limited inapplicability. Though these techniques have shown that certain aspects of Data Mining can be performed within the
DBMS efficiently, providing a challenge for researchers into investigating how the data mining operations can be
divided into DBMS operations and non-DBMS operations to make the most of both worlds.
6.4 Visualisation: Visualisation techniques are used within the discovery process at two levels. Firstly, visualising the
data enhances exploratory data analysis. Exploratory Data Analysis is useful for data pre-processing allowing the user
to identify outliers and data subsets of interest. Secondly, Visualisation may be used to make underlying patterns in the
database more visible. NETMAP, a commercially available Data Mining Tool derives most of its power from this
pattern visualisation technique.
7. Knowledge Representation
7. 1 Neural Networks (see section 5.1.1)
7.2 Decision Tree
fig. 2 : An example Decision Tree
ID3 [QUIN86] is probably the most well known classification algorithms in machine learning that uses decision trees.
ID3 belongs to the family of TDIDT algorithms (Top-Down Induction of Decision Trees) and has undergone a
number of enhancements since it's conception e.g. ACLS [PATE83], ASSISTANT [KONO84].
A Decision Tree is a tree based knowledge representation methodology used to represent classification rules. The leaf
nodes represent the class labels while other nodes represent the attributes associated with the objects being classified.
The branches of the tree represent each possible value of the attribute node from which they originate. Figure 2 shows
a typical decision tree [QUIN86].
Once the decision tree has been built using a training set of data it may be used to classify new objects. To do so we
start at the root node of the tree and follow the branches associated with the attribute values of the object until we
reach a leaf node representing the class of the object.
Clearly, for a training set of examples there are a large number of possible decision trees that could be generated. The
basic idea is to pick the decision tree that would correctly classify the most unknown examples (which is the essenceof the induction process). One way of doing this is to generate all the possible decision trees for the training set and
picking the simplest tree. Alternatively, the tree could be built in such a way that the final tree is the best. In ID3 they
use an information theoretic measure for 'gain in information' by using a particular attribute as a node to decide on an
attribute for a particular node.
Though decision trees have been used successfully in a number of algorithms, they have a number of disadvantages.
Firstly, even for small training sets decision trees can be quite large and thus opaque. Quinlan [QUIN87] points out
that it is questionable whether opaque structures like decision trees can be described as knowledge, no matter how
well they function. Secondly, in the presence of missing values for attributes of objects in the test data set, trees can
have a problem with performance. Also the order of attributes in the tree nodes can have an adverse affect on
performance.
The main advantage of decision trees is their execution efficiency many due to their simple and economical
representation and ability to perform even though they lack the expressive power of semantic networks or other first
order logic methods of knowledge representation.
7.3 Rules
Rules are probably the most common form of data representation. A rule is a conditional statement that specifies an
action for a certain set of conditions, normally represented as X ® Y. The action, Y, is normally called the
consequent of the rule and the set of conditions, X, its antecedent. A set of rules is an unstructured group of IF..
THEN statements.
The popularity of rules as a method of knowledge representation is mainly due to their simple form. They are easily
interpreted by humans as they are a very intuitive and natural way of representing knowledge, unlike decision trees
and neural networks. Also as a system of rules is unstructured, it is less rigid, which can be advantageous at the early
stages of the development of a knowledge based system.
But representing knowledge as rules has a number of disadvantages. Rules lack variation and are unstructured. Their
format is inadequate to represent many types of knowledge e.g. causal knowledge. As the number of rules in the
system increases the performance of the system decreases and the system becomes more difficult to maintain and
modify. New rules cannot be added arbitrarily to the system as they may contradict existing rules of the system leading
to erroneous conclusions. The degradation in performance of a rule based system is not graceful.
The lack of structure in rule based representations makes the modelling of the real-world difficult if not impossible.
Thus, a more organized and structured representation for knowledge is desirable that can make partial inferences and
degrade gracefully with size.
7.4 Frames
Frames are a template for holding clusters of related knowledge about a particular, narrow subject, which is often the
name of the frame. The clustering of related knowledge is a more natural way of representing knowledge as a model
of the real-world.
Each frame consists of a number of slots that contain attributes, rules, hypotheses and graphical information related to
the object represented by the frame. These slots may be frames in their own right, giving the frames a hierarchical
structure. Relationships between frames are taxonomic and therefore a frame inherits properties of its 'parent frames'.
Thus, if the required information is not contained in a frame the next frame up in the hierarchy is searched.
Due to the structuring of knowledge in frames, representing knowledge in frames is more complex than a rule-basedrepresentation.
8. Related Technologies
8.1 Data Warehousing - "to manage the data that analyses your business"
On-line Transaction Processing (OLTP) systems are inherently inappropriate for decision support querying and,
therefore, the need for Data Warehousing. A data warehouse is a Relational Database Management System designed
to provide for the needs of decision makers rather than the needs of transaction processing systems. Thus, a data
warehouse provides data in a form suitable for business decision support. More specifically, a Data Warehouse
allows
Any business questions to be asked
Any data in the enterprise to be included in the analysis
Interactive Analysis, therefore, necessarily Uninhibited Performance so that the decision making process is not
inhibited in any way
A Data Warehouse brings together large volumes of business information obtained from transaction processing and
legacy operational systems. The information is cleansed and transformed so that it is complete and reliable and it is
stored over time so that trends can be identified. Data Warehouses are normally employed in one of two roles:
A provider of business relevant information to managers and analysts
A "closed-loop" system performing information driven functions such as intelligent inventory reordering
OLTP systems are designed to capture, store and manage day-to-day operations. The raw data collected in OLAP
systems exists in a number of different formats like hierarchical databases, flat files, COBOL datasets, in legacy
systems keeping them out of reach from business decision makers.
Ad-hoc queries and reports can take days
The nature of tuning an OLTP application makes rapid retrieval for business analysis impossible
SQL queries cannot deliver correlated information needed by business analysts
Typically Data Warehousing software has to deal with large updates with narrow batch windows. Getting the data intothe warehouse and fully preparing it for use is the key update to the warehouse. Therefore, a warehouse must provide
for the following requirements:
Data must be read from a number of different feeds e.g. disk files, magnetic tapes
Data must be converted to the database internal format from a variety of formats
Data must be filtered to reject invalid values
Records must be reorganised to match the relational schema
Records must be checked against the existing database to ensure global consistency and referential integrity
(inter table references)
Records must be written to physical storage observing requirements of data segmentation, physical device
placement
Records must be fully and richly indexed
system metadata must be updated
Heterogeneity resolution
Additionally, a Data Warehouses must:
Provide mechanisms to continuously guarantee overall data quality
Not have any architectural limits - They must be able to handle terabytes of data.
Require minimal Storage Management activities - no reorganisation should be required. For other Management
activities, they must be modular and parallelisable
Allow for hardware failures and continue to make available unaffected parts of the database as large databases
require large hardware dependence
Provide query performance dependent only on the complexity of the query, not on the size of the database
A Data Warehouse presents a dimensional view of the data. Users expect to view this data from different
perspectives or dimensions. Such functionality is provided by On-line Analytical Processing. Two general approaches
have emerged to meet this requirement of OLAP
Self-contained Multi-dimensional Databases (MDDB): Contain summaries and rich tools for exploring this
data. When the MDDB user needs to "drill down", the underlying database is used
Dimensional tools layered above the database.
Red Brick Warehouse V(Very Large Data Warehouse Support)P(Parallel Query Processing)T(Time Based
Data Management)
Consists of three components
A Database Server supporting SQL plus decision support extensions (RISQL - Red Brick Intelligent SQL):
Specialised Indexes designed solely for retrieval
B-TREE
STAR: Automatically created when tables are created. Join processing is greatly enhanced using
these indexes as they maintain relationships between primary keys and foreign keys
PATTERN: These are fully-inverted text indexes that reduce search time for partial character
string matching
Powerful extensions to SQL
Business analysis functions that perform sequential calculations
Numeric and string functions to manipulate character strings and numeric values
Macro building capabilities that simplify the use of repetitive SQL and calculationsStandard SQL is a set-oriented language. Thus all operations operate on unordered sets of data.
This does not allow SQL to answer many useful business questions e.g. moving averages, ranking,
n-tile ordering. It does not provide even basic statistical functionality.
Example query: What are the top ten products sold during the second quarter of 1993 and what were their rankings
by dollars and units sold?
A High Performance load subsystem called the Table Management Utility (TMU)
Provides data loading and index-building facilities with performance necessary in a data warehouse
Ability to transform OLTP data to a more appropriate form for business data analysis i.e. the warehouse
schema
Gateway technologies supporting client/ server access to the warehouse: Allows terminal and client/ server
access to the warehouse
8.2 On-line Analytical Processing
Today a business enterprise can prosper or fail based on the availability of accurate and timely information. The
Relational Model was introduced in 1970 by E. F. Codd in an attempt to allow the storage and manipulation of large
data sets about the day-to-day working of a business. However, just storing the data is not enough. Techniques are
required to help the analyst infer information from this data that can be used by the business enterprise to give it an
edge over the competition.
Relational Database Management System (RDBMS) products are not very good at transforming data stored in OLTP
applications and converting it into useful information. A number of reasons contribute to why RDBMSs are notsuitable for business analysis:
End-Users want a multi-dimensional/ business-oriented view of the data not which columns are indexed which
are the primary and foreign keys etc.
To make OLTP queries fast RDBMS applications generally normalise the data into 50 - 200 tables. Though
great for OLTP operations this is a nightmare for OLAP applications as it means a large number of joins are
required to access the data required for OLAP.
While parallel processing can be useful in table scans it offers very little performance enhancement for complex
joins
SQL is not designed with common business needs in mind. There is no way of using SQL for retrieving
information like "the top 10 salespersons", "bottom 20% of customers", "products with a market share of
greater than 25%" or "the sales ratio of cola to root beer".
Do not provide common data analysis tools like data rotation, drill downs, dicing and slicing
To allow truly ad-hoc end-user analysis, the Database Administrator must index the database on every possible
combination of columns and tables that the end user may ask for. This would create an unnecessary overhead
for OLTP and query response times.
Locking models, data consistency schemes, caching algorithms are based on the use of the RDBMS being used
for OLTP applications where the transactions are small and discrete. Long running, complex queries cause
problems in each of these areas.
Complex statistical functionality was never intended to be provided along with the RDBMS. Providing such
functionality was left to the user-friendly end-user products like spreadsheets that were to act as front ends to the
RDBMS. Though spreadsheets have provided a certain amount of functionality required by business analysts none
address the need for analysing the data according to its multiple dimensions.
Any product that intends to provide such functionality to business analysts must allow the following:
access to many different types of files
creation of multi-dimensional views of the data
experimentation with various data formats and aggregations
definition and animation of new information models
application of summations and other formulae to these models
Drilling down, Rolling up, slicing and dicing, rotation of consolidation pathsGeneration of a wide variety of reports
On-line Analytical Processing (OLAP) is the name given, by E. F. Codd (1993), to the technologies that attempt to
address these user requirements. Codd defined OLAP as "the dynamic synthesis, analysis and consolidation of large
volumes of multi-dimensional data". Codd provided 12 rules/ requirements of any OLAP system. These were:
Multi-dimensional Conceptual View
Transparency
Accessibility
Consistent Reporting Performance
Client-Server Architecture
Generic Sparse Matrix Handling
Multi User Support
Unrestricted Cross-dimensional Operations
Intuitive Data Manipulation
Flexible Reporting
Unlimited Dimensions and Aggregation Levels
Nigle Pendse provided another definition of OLAP which, unlike Codds definition, does not mix technology
prescriptions with application requirements. He defined OLAP as "Fast Analysis of Shared Multidimensional
Information". Within the definition:
FAST means that the system should provide answers to most queries within five seconds with only the very
complex queries taking a maximum of twenty seconds. This is so that the analyst does not loose his/ her chain
of thought due to delayed responses from the system.
ANALYSIS means that the system can cope with business logic and statistical analysis that is relevant to the
users application. The analysis functions should be provided to the user in an intuitive way.
SHARED means that the system services multiple users concurrently.
MULTIDIMENSIONAL (a key OLAP requirement) means that the system must provide a mutlidimensional
conceptual view of the data including support for multiple hierarchies
INFORMATION is the data and derived information required by the user application.
9. State of the Art and Limitations of Commercially Available Products
9.1 End-User Products
Most end-user products available in the market do not address many of the aspects of data mining enumerated in
section 3. Infact, these packages are really machine learning packages with added facilities for accessing databases.
Having said that they are powerful tools that can be very useful given a clean data set. However clean data sets are
not found in real-world applications and cleansing large data sets manually is not possible.
In this section we discuss two end-user products available in the UK. However, a much larger number of so called
"data mining tools" exist in the market.
9.1.1 CLEMENTINE
This package supplied by ISL Ltd., Basingstoke, England is a very easy to use package for "data mining". The
interface has been built with the intention of making it "as easy to use as a spreadsheet". CLEMENTINE uses a Visual
Programming Interface for building the discovery model and performing the learning tasks.
Accessible Data Sources: ASCII File format, Oracle, Informix, Sybase and Ingres.
Discovery Paradigms: Decision Tree Induction and Neural Network (Mutli Layer Perceptron).
Data Visualisation: Through interactive histograms, scatter plots, distribution graphs etc.
Data Manipulation: Sampling, Derive New Attributes, Filter Attributes
Hardware Platforms: Most Unix Workstations
Statistical Functionality: Correlations, Standard Deviation etc.
Deploying Applications: A Trained Neural Network or Decision Tree may be exported as C.
Data Mining Tasks Suitability: Classification problems with clean data sets available.
Consultancy: Available
9.1.2 DataEngine
This package is supplied by MIT GmbH, Germany. It also provides a Visual Programming interface.
Accessible Data Sources: ASCII File format, MS-Excel
Discovery Paradigms: Fuzzy Clustering, Fuzzy Rule Induction, Neural Network (Multi-Layered Perceptron,
Kohonen Self-Organising Map), Neuro-Fuzzy Classifier
Data Visualisation: Scatter and line plots, Bar charts and Area plots
Data Manipulation: Missing Data Handling, Selection, Scaling
Hardware Platforms: Most UNIX Workstations & Windows.
Statistical Functionality: Correlation, Linear Regression etc.
Deploying Applications: DataEngine ADL allows the integration of classifiers into other software Environments
Data Mining Tasks Suitability: Classification
Consultancy: Available
9.2 Consultancy Based Products
9.2.1 SGI
Silicon Graphics provide "Tailored Data Mining Solutions" which include Hardware support in the form of
CHALLENGE family of database servers and software support in the form of Data Warehousing and Data Mining
software. The CHALLENGE servers provide unique Data Visualisation capabilities an area that Silicon Graphics are
recognised as leaders. The Interface provided by SGI allows you to "fly" through visual representations of your data
allowing you to identify important patterns in your data and directing you to the next question you should ask withinyour analysis!
Apart from Visualisation, SGI provides facilities for Profile Generation and Mining for Association Rules.
9.2.2 IBM
IBM provide a number of tools to give users a powerful interface to Data Warehouses.
9.2.2.1 IBM Visualizer: Provides a powerful and comprehensive set of ready to use building blocks and
development tools that can support a wide range of end-user requirements for query, report writing, data analysis,
chart/ graph making and business planning.
9.2.2.2 Discovering Association Patterns: IBMs Data Mining group at Almaden pioneered research into efficient
techniques for discovering associations in buying patterns in supermarkets. There algorithms have been successfully
employed in supermarkets in the USA to discover patterns in the supermarket data that could not have been
discovered without data mining.
9.2.2.3 Segmentation or Clustering: Data Segmentation is the process of separating data tuples into a number of
sub-groups based on similarities in their attribute values. IBM provides two solutions based on two different discovery
paradigms: Neural Segmentation and Binary Segmentation. Neural Segmentation is based on a Neural Network
technique called self-organising maps. Binary Segmentation was developed at IBMs European Centre for Applied
Mathematics. It is based on a technique called relational analysis. This technique was developed to deal with binary
data.
Applications: Basket Analysis
10. Case Studies and Data Mining Success Stories
Case Study 1: Health Care Analysis using Health-KEFIR
KEFIR (Key Findings Reporter) is a domain independent system for discovering and explaining key findings i.e.
deviations developed by GTE labs.
KEFIR consists of four main parts:
Deviation Detection: Input to KEFIR:
Data
Predefined measures for deviation detection e.g. In Healthcare Analysis there are standard measures
used such as: Average_hospital_payments_per capita, Admission_rate_per_1000_people etc.
Predefined categories to create sub-populations (sector): The full population in question is the top sector.
Based on certain categories this population is sub-divided into sub-sectors recursively. e.g. The Inpatient
population based on category, Admission Type, may be spilt into Inpatient Surgical, Inpatient Medical,
Inpatient Mental etc.
KEFIR starts by discovering deviations in the predefined measures for the top sector and then discovers deviations in
the sub-sectors that seem interesting.
Evaluation/ Ordering by Interestingness: Two factors are taken into account when ordering the deviations in
descending order of interestingness. These are:
Impact of deviation e.g. effect of deviation on payments for healthcare.
Probability of success of associated recommendation
These two factors together give a value for potential saving which is used to rank the deviations.
Statistical significance is another important factor e.g. the deviation may be due to just one very costly case -
chances are it will not occur again. Measures like standard deviation may be useful.
Explanation: An explanation for a deviation can be found in two ways:
Investigating the underlying formula: If the measure for which the deviation has been detected has a
formula associated with it e.g. Total Payments = Pay_per_day*no_of_days - an explanation may be
found by investigating the deviation in the component measures of the formula.
Investigating the sub-populations: An explanation may be found by looking at which subpopulation(s) is
causing the deviation to occur.
Recommendation: Expert System based - provided by the domain expert. The Output is a Report including
business graphics and all.
KEFIR needs to be tailored to the specific domain. The Healthcare application resulted in Health-KEFIR, a versionspecific to Healthcare. The success of KEFIR-Health proves that Data Mining solutions need to be tailored to the
specific domain and specific solutions to mining tasks are more realistic than multi-purpose solutions.
Case Study 2: Mass Appraisal for Property Taxation
The Data Set consisted of 416 data items. One in three items were selected as the holdout sample
For each property the following variables were used as input to the Neural Network:
Ward
Transaction Date (Month and Year)
Size (Area)
Number of Bedrooms
House Class (Detached or Semi-Detached)
Age
House Type (Bungalow, House, CH, FH)
Heating Type (OFCH, FGCH, FSCH, PSCH, PECH, PGCH, GLHO, None, Not Known)
Garage (Drive, Single, None)
The goal was to predict the house price based on the other attributes of the houses. We used a Neural Network for
the prediction and achieved an accuracy of approximately 82% with a mean error of approximately 15%. We also
explored using a Rule Induction approach. The accuracy achieved by the Neural Network was greater than that
achieved through rule induction.
Using simple data visualisation outliers in the data were spotted and removed from the data. The neural network was
trained on the remaining dataset. The predictive accuracy of the network improved from 82% to approximately 93%
with the mean error reducing to approximately 7.8%.
Case Study 3: Policy Lapse/Renewal Prediction
A Large Insurance Company had collected a large amount of data on its motor insurance customers. Using this data,
the company wanted to be able to predict in advance which policies were going to lapse and which ones were going
to be renewed when they near expiry. The advantage of such a prediction is obvious. As the competition increases inthe business world customer retention has become an important issue. The insurance company wanted to target new
services at their customers to woo them into renewing their policies and not moving to other insurance companies.
Clearly, rather than targeting customers that are going to renew their policy anyway it would be better to target
customers who are more likely to lapse their policy. Data Mining provided the solution for the company.
Of the 34 different attributes stored for each policy we picked 12 attributes that seemed to have an effect on whether
or not the policy was going to lapse or not. Using a neural network we achieved a predictive accuracy of 71% and
using rule induction based techniques we were able to identify attributes of an insurance policy that was going to lapse.
The whole exercise took approximately 3 weeks to complete. The accuracy achieved by Data Mining equalled the
accuracy achieved by the companies statisticians who undoubtedly spent many more man months on their statistical
models.
11. The Mining Kernel System: The University of Ulster's perspective
The Mining Kernel System (MKS) being developed in the authors' laboratory embodies the inter-disciplinary nature
of Data Mining by providing facilities from Statistics, Machine Learning, Database Technology and Artificial
Intelligence. The functionality provided by MKS forms a strong foundation for building powerful Data Mining
implements required to tackle what is clearly recognised to be a complex problem.
MKS is a set of libraries implemented by the authors to remove the mundane, repetitive tasks like data access,
knowledge representation and statistical functionality from Data Mining so as to allow the user to concentrate on the
more complex aspects of the discovery algorithms being implemented. In this section we describe the motivation
behind each of the libraries within MKS along with brief descriptions of these libraries.
Figure 2 shows the architecture of MKS.
Figure 2: The Mining Kernel System (MKS)
At present MKS has 7 libraries providing Data Mining facilities. The facilities provided by MKS may be split into two
main modules: The Interface Module and the Mining Module.
The Interface Module provides facilities for Mining algorithms to interact with the environment. MKS has three distinct
interfaces to the outer world - the User using the User Interface, the Data Interface using the VDL Library and the
Knowledge Interface using the KIO Library. Information provided by the user using the user interface include the data
view in the form of the Data Source Mapping file (see section 2.1.2), Domain Knowledge, Syntactic Constraints e.g.
antecedent attributes of interest and Support, Uncertainty and Interestingness thresholds.
The Mining module provides core facilities required by most mining algorithms. At present the Mining module of MKS
consists of 5 Libraries that provide facilities from Conventional Statistics, Machine Learning and Artificial Intelligence.
The libraries within this module of MKS are: the Statistical Library (STS), the Information Theoretic Library (INF),
the Knowledge Representation Library (KNR), the Set Handling Library (SET) and the Evidence Theory Library
(EVR).
References
[AGRA92] Agrawal R, Ghosh S, Imielinski T, Iyer B and Swami A, An interval classifier for database mining
applications, Proc of 18th Int'l Conf. on VLDB, pp 560-573, 1992.
[AGRA93] Agrawal R, Imielinski T and Swami A, Database mining: A performance perspective, IEEE Transactions
on Knowledge and Data Engineering, Special issue on Learning and Discovery in Knowledge-Based Databases,
1993.
[AGRA93a] Agrawal R, Imielinski T and Swami A, Mining association rules between sets of items in large databases,
Proc of the ACM SIGMOD Conf. on Management of Data, 1993.
[AGRA93b] R. Agrawal, C. Faloutsos, A. Swami. Efficient similarity search in sequence databases. Proc. of the 4th
International Conference on Foundations of Data Organisation and Algorithms, 1993.
[AGRA94] R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. Proc. of
VLDB94, Pg. 487 - 499, 1994.
[AGRA95] R. Srikant, R. Agrawal. Mining Generalized Association Rules. Proc. of VLDB95.
[AGRA95a] Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time-Series Databases. Pro.
of VLDB95.
[ANAN95] S.S. Anand, D.A. Bell and J.G. Hughes, A General Framework for Data Mining Based on Evidence
Theory, Provisionally accepted by Data and Knowledge Engineering Journal.
[BELL93] D. A. Bell, From Data Properties to Evidence, IEEE Transactions on Knowledge and Data Engineering,
Vol. 5, No. 6, Special Issue on Learning and Discovery in Knowledge - Based Databases, December, 1993.
[BERN94] D. J. Berndt, J. Clifford. Using dynamic time warping to find patterns in time series. KDD94: AAAI
Workshop on Knowledge Discovery in Databases, Pg. 359 - 370, July, 1994.
[CHAN91] K. C. C. Chan, A. K. C. Wong. A Statistical Technique for Extracting Classificatory Knowledge from
Databases. Knowledge Discovery in Databases, Pg. 107 - 124, AAAI/MIT Press 1991.
[FRAW91] W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, Knowledge Discovery in Databases : An
Overview Knowledge Discovery in Databases, Pg. 1 - 27, AAAI/MIT Press 1991.
[HAN94] J. Han. Towards efficient induction mechanisms in database systems. Theoretical Computer Science 133,
Pg. 361 - 385, 1994.
[IMAM93] I. F. Imam, R. S. Michalski and L. Kershberg, Discovering Attribute Dependencies in Databases by
Integrating Symbolic Learning and Statistical Techniques, Working Notes of the Workshop in Knowledge Discovery
in Databases, AAAI-93, 1993.
[LU95] H. Lu, R. Setiono, H. Liu. NeuroRule: A Connectionist Approach to Data Mining. Proc. of VLDB95.
[PEND91]E. P. D. Pendault. Minimal-Length Encoding and Inductive Inference. Knowledge Discovery in Databases,
Pg. 71 - 92, AAAI/MIT Press 1991.
[SMYT91] ] P. Smyth and R. M. Goodman, Rule Induction Using Information Theory, Knowledge Discovery in
Databases, Pg. 159 - 176, AAAI/MIT Press 1991.
[WUQ91] Q. Wu, P. Suetens and A. Oosterlinck, Integration of Heuristic and Bayesian Approaches in a Pattern -
Classification System, Knowledge Discovery in Databases, Pg. 249 - 260, AAAI/MIT Press 1991.
[YAGE91] R. R. Yager, On Linguistic Summaries of Data, Knowledge Discovery in Databases, Pg. 347 - 366
AAAI/MIT Press 1991.
[ZIAR91] W. Ziarko, The Discovery, Analysis and Representation of Data Dependencies in Databases, Knowledge
Discovery in Databases, Pg. 195 - 212, AAAI/MIT Press 1991.