Data mining(DM)




Citation preview

Advanced Technology for Knowledge Management

Data Mining : The Discovery Technology for Knowledge Management

Yike Guo

Dept. of ComputingImperial College

Advanced Technology for Knowledge Management

Course Overview

• Goal– Basic Concepts of Data Mining

– Basic Data Mining Techniques

– Data Mining procedure in Real World Applications

– Future Research Trends on Data Mining

• Reference Books• Advances in Knowledge Discovery and Data Mining U.M Fayyad and G,

Piatetsky-Shapiro AAAI/MIT Press. 1996

• Predictive Data Mining: A Practical Guide Sholom M.Weiss and Nitin Indurkhya Morgan Kaufmann Publishers, Inc. 1997

• Data Mining Techniques Wiley Computer Publishing, 1997

Advanced Technology for Knowledge Management

What does the data say?

Day Outlook Temperature Humidity Wind Play Tennis

1 Sunny Hot High Weak No2 Sunny Hot High Strong No3 Overcast Hot High Weak Yes4 Rain Mild High Weak Yes5 Rain Cool Normal Weak Yes6 Rain Cool Normal Strong No7 Overcast Cool Normal Strong Yes8 Sunny Mild High Weak No9 Sunny Cool Normal Weak Yes10 Rain Mild Normal Weak Yes11 Sunny Mild Normal Strong Yes12 Overcast Mild High Strong Yes13 Overcast Hot Normal Weak Yes14 Rain Mild High Strong No

Advanced Technology for Knowledge Management

Turing Data into Knowledge

Advanced Technology for Knowledge Management

Data Mining

Machine LearningStatistics

Databases HighPerformance& DistributedComputing

Data Mining


Enabling TechnologyDecision Support Knowledge Discovery

Advanced Technology for Knowledge Management

Why Data Mining

• Limitation of traditional database querying:– Most queries of interest to data owners are difficult to

state in a query language• “ find me all records indicating fraud”=> “ tell me the

characteristics of fraud” (Summarisation)• “find me who likely to buy product X” (classification problem)• “find all records that are similar to records in table X”

(clustering problem)

– Ability to support analysis and decision making using traditional (SQL) queries become infeasible (query formulation problem ).

Advanced Technology for Knowledge Management

Relational Database Revisited• Terabyte databases, consisting of billions of records, are

becoming common• Relational data model is the defacto standard• A relational database : set of relations• A relation : a set of homogenous tuples• Relations are created, updated and queried using SQL• Query = Keyword based search

SELECT telephone_number

FROM telephone_book

WHERE last_name = “Smith”

Advanced Technology for Knowledge Management

SQL : Relational Querying Language

• Provides a well-defined set of operations: scan, join, insert, delete, sort, aggregate, union, difference

• Scan -- applies a predicate P to relation RFor each tuple tr from R

if P(tr) is true, tr is inserted in the output stream

• Join -- composes two relations R and SFor each tuple tr from R

For each tuple ts from S

if join attribute of tr equals to join attribute of ts

form output tuple by concatenating tr and ts

Advanced Technology for Knowledge Management

The Query Formulation Problem

• It is not solvable via query optimisation• Has not received much attention in the database field or in traditional

statistical approaches• These problems are of inductive features: learning from data rather than

search from data• Natural solution is via train-by-example approach to construct inductive

models as the answers

Consider the query :

What kinds of weather condition are suitable for playing tennis ?

Advanced Technology for Knowledge Management

Why Data Mining Now• Data Explosion

– Business Data : organisations such as supermarket chains, credit card companies, investment banks, government agencies, etc. routinely generate daily volumes of 100MB of data

– Scientific Data: Scientific and remote sensing instruments collect data at the rates of Gigabytes per day: far beyond human analysis abilities.

• Data Wasting– Only a small portion (5% - 10%) of the collected data is ever analysed– Data that may never be analysed continues to be collected, at great expense.

• We are drowning in data, but starving for knowledge!

Advanced Technology for Knowledge Management

What is Data Mining

Data Mining: a non-trivial data analysis process for identifying valid, useful and understandable patterns from databases.

Advanced Technology for Knowledge Management

• Data: set of facts F ( records in a database)

• Pattern : An expression E in a language L describing data in a subset FE of F and E is simpler than the enumeration of al l the facts of FE. FE is also called a class and E is also called a model or knowledge.

• Data Mining Process: data mining is a multi-step process involving multiple choices, iteration and evaluation. It is non-trivial since there is no closed-form solution. It always involve intensive search.

• Validity : E is true (with high probability) for F

• Useful : patterns are not trivial inductive properties of data

• Understandable: patterns should be understandable by data owners to aid in understanding the data/domain

Advanced Technology for Knowledge Management

Historical Data(Data Warehouse) Predictive


Operational Data Business Action


Data Mining System

Decision Support System




How Data Mining Works

Advanced Technology for Knowledge Management

Data Warehousing

• “ A data warehouse is a subject-oriented, integrated, time-variant,

and nonvolatile collection of data in support of management’s

decision-making process.” --- W. H. Inmon

• A data warehouse is

– A decision support database that is maintained separately from

the organization’s operational databases.

– It integrates data from multiple heterogeneous sources to

support the continuing need for structured and /or ad-hoc

queries, analytical reporting, and decision support.

Advanced Technology for Knowledge Management

Modeling Data Warehouses

• Modeling data warehouses: dimensions & measurements

– Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables) radically.

– Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables.

– Fact constellations: Multiple fact tables share dimension tables.

• Storage of selected summary tables:

– Independent summary table storing pre-aggregated data, e.g., total sales by product by year.

– Encoding aggregated tuples in the same fact table and the same dimension tables.

Advanced Technology for Knowledge Management

Example of Star Schema

Many Time Attributes

Time Dimension Table

Many Store Attributes

Store Dimension Table

Sales Fact Table









Many Product Attributes

Product Dimension Table

Many Location Attributes

Location Dimension Table

Advanced Technology for Knowledge Management

Example of a Snowflake Schema

Many Time Attributes

Time Dimension Table

Many Store Attributes

Store Dimension Table

Sales Fact Table









Product Dimension Table


Location Dimension Table







Advanced Technology for Knowledge Management

A Star-Net Query Model

Shipping Method



Customer Orders
















Advanced Technology for Knowledge Management

View of Warehouses and Hierarchies

• Importing data

• Table Browsing

• Dimension creation

• Dimension browsing

• Cube building

• Cube browsing

Advanced Technology for Knowledge Management

Construction of Data Cubes


0-20K20-40K 60K- sum


… ...








All AmountComp_Method, B.C.

Each dimension contains a hierarchy of values for one attributeA cube cell stores aggregate values, e.g., count, sum, max, etc.A “sum” cell stores dimension summation values.Sparse-cube technology and MOLAP/ROLAP integration.“Chunk”-based multi-way aggregation and single-pass computation.

Advanced Technology for Knowledge Management

OLAP: On-Line Analytical Processing• A multidimensional, LOGICAL view of the data.

• Interactive analysis of the data: drill, pivot, slice_dice, filter.

• Summarization and aggregations at every dimension intersection.

• Retrieval and display of data in 2-D or 3-D crosstabs, charts, and graphs, with easy pivoting of the axes.

• Analytical modeling: deriving ratios, variance, etc. and involving measurements or numerical data across many dimensions.

• Forecasting, trend analysis, and statistical analysis.

• Requirement: Quick response to OLAP queries.

Advanced Technology for Knowledge Management

OLAP Architecture• Logical architecture:

– OLAP view: multidimensional and logic presentation of the data in the data warehouse/mart to the business user.

– Data store technology: The technology options of how and where the data is stored.

• Three services components:– data store services

– OLAP services, and

– user presentation services.

• Two data store architectures:– Multidimensional data store: (MOLAP).

– Relational data store: Relational OLAP (ROLAP).

Advanced Technology for Knowledge Management

Dimension Browsing

• Product <======

• Location ======>

Advanced Technology for Knowledge Management

Decision Support with Data Warehouse• Ad Hoc Queries: Q: How many customers do we

have in London? A: 32776

Advanced Technology for Knowledge Management

• Report and Spreadsheet

Advanced Technology for Knowledge Management

• OLAP: Q:What are the sales figures for Y in the different regions:

Advanced Technology for Knowledge Management

• Statistics: Q: Is there a relation between age and buy

behaviour? A: Older clients buy more

Advanced Technology for Knowledge Management

• Data Mining: Q: What factors influence buying behaviour ?

A1: : Young men in sports cars buy 3 times as much audio equipment (clustering/regression):

A2: Older woman with dark hair more often buy rinse (classification)

A3: Buyers of cars are also the buyers of houses (asociation)


Old YoungMiddle




Hair color



Advanced Technology for Knowledge Management

Example Data Mining Applications• Commercial :

– Fraud detection: Identify Fraudulent transaction

– Loan approval: Establish the credit worthiness of a customer requesting a loan

– Investment analysis : Predict a portfolio's return on investment

– Marketing and sales data analysis: Identify potential customers; establishing the effectiveness of a sales campaign

• Medical:– Drug effect analysis : from patient records to learn drug effects– Disease causality analysis

• Political policy:– Election policy : people’s voting patterns– Social policy: tax/benefit policy

• Manufacturing:– Manufacturing process analysis: identify the causes of manufacturing problems

– Experiment result analysis : Summarise experiment results and create predictive models

Advanced Technology for Knowledge Management

• Scientific data analysis: cataloguing in surveys, basic processing needed before higher-level science

analysis can occur, scientific discovery over large data sets.

Theory Experiments

SimulationData Assimilation(Data Warehousing)

Data Mining(Statistical Computing and Machine Learning)

Numerical Computing(Iterative Equation Solving)

Numerical Computing : simulating the real world systems based on the underlying theoryData Assimilation :comprehending, consolidating and warehousing the simulation/experiment dataData Mining : analysis the warehoused simulation/experiment data for knowledge discovery

Advanced Technology for Knowledge Management

Related Fields:• Machine learning: Inductive reasoning

• Statistics : Sampling, Statistical Inference, Error Estimation

• Pattern recognition: Neural Networks, Clustering

• Knowledge Acquisition, Statistical Expert Systems

• Data Visualisation

• Databases: OLAP, Parallel DBMS, Deductive Databases

• Data Warehousing: collection, cleaning of transactional data for on-line retrial
