21
C.-C. Chan Department of Computer Science University of Akron Akron, OH 44325-4003 USA [email protected] 1 UA Faculty Forum 2008 by C.-C. Chan

Web-Based Data Mining System

  • Upload
    tommy96

  • View
    491

  • Download
    2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Web-Based Data Mining System

C.-C. ChanDepartment of Computer Science

University of AkronAkron, OH 44325-4003

[email protected]

1UA Faculty Forum 2008 by C.-C. Chan

Page 2: Web-Based Data Mining System

OutlineOverview of Data MiningSoftware ToolsA Rule-Based System for Data MiningConcluding Remarks

2UA Faculty Forum 2008 by C.-C. Chan

Page 3: Web-Based Data Mining System

Data Mining (KDD)From Data to KnowledgeProcess of KDD (Knowledge Discovery in

Databases)Related TechnologiesComparisons

3UA Faculty Forum 2008 by C.-C. Chan

Page 4: Web-Based Data Mining System

Why KDD?

We are drowning in information, but starving for knowledge John Naisbett

Growing Gap between Data Generation and Data Understanding:Automation of business activities:

Telephone calls, credit card charges, medical tests, etc.Earth observation satellites:

Estimated will generate one terabyte (1015 bytes) of data per day. At a rate of one picture per second.Biology: Human Genome database project has collected over gigabytes of data on the human genetic code [Fasman, Cuticchia, Kingsbury, 1994.]

US Census data: NASA databases: …World Wide Web:

4UA Faculty Forum 2008 by C.-C. Chan

Page 5: Web-Based Data Mining System

Process of KDD

5

[1] Fayyad, U., Editorial, [1] Fayyad, U., Editorial, Int. J. of Data Mining and Knowledge DiscoveryInt. J. of Data Mining and Knowledge Discovery, Vol.1, Issue 1, 1997., Vol.1, Issue 1, 1997.[2] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery: an [2] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery: an overview," in overview," in Advances in Knowledge Discovery and Data MiningAdvances in Knowledge Discovery and Data Mining, Fayyad et al (Eds.), MIT Press, 1996., Fayyad et al (Eds.), MIT Press, 1996.

UA Faculty Forum 2008 by C.-C. Chan

Page 6: Web-Based Data Mining System

Process of KDD

1. Selection Learning the application domain Creating a target dataset

2. Pre-Processing Data cleaning and preprocessing

3. Transformation Data reduction and projection

4. Data Mining Choosing the functions and algorithms of data mining Association rules, classification rules, clustering rules

5. Interpretation and Evaluation Validate and verify discovered patterns

6. Using discovered knowledge

6UA Faculty Forum 2008 by C.-C. Chan

Page 7: Web-Based Data Mining System

Typical Data Mining TasksFinding Association Rules [Rakesh Agrawal et al,

1993]Each transaction is a set of items.

Given a set of transactions, an association rule is of the form X Y

where X and Y are sets of items. e.g.: 30% of transactions that contain beer also contain

diapers; 2% of all transactions contain both of these items.

Applications:Market basket analysis and cross-marketingCatalog designStore layoutBuying patterns

7UA Faculty Forum 2008 by C.-C. Chan

Page 8: Web-Based Data Mining System

Finding Sequential Patterns Each data sequence is a list of transactions. Find all sequential patterns with a user-specified minimum

support. e.g.: Consider a book-club database A sequential pattern might be

5% of customers bought “Harry Potter I”, then “Harry Potter II”, and then “Harry Potter III”.

Applications:Add-on salesCustomer satisfactionIdentify symptoms/diseases that precede certain

diseases

8UA Faculty Forum 2008 by C.-C. Chan

Page 9: Web-Based Data Mining System

Finding Classification Rules Finding discriminant rules for objects of different

classes.Approaches:

Finding Decision Trees Finding Production Rules

Applications:Process loans and credit cards applicationsModel identification

9UA Faculty Forum 2008 by C.-C. Chan

Page 10: Web-Based Data Mining System

Text MiningWeb Usage MiningEtc.

10UA Faculty Forum 2008 by C.-C. Chan

Page 11: Web-Based Data Mining System

Related Technologies Database Systems

MS SQL server Transaction databases OLAP (Data Cubes) Data Mining

Decision Trees Clustering Tools

Machine Learning/Data Mining Systems CART (Classification And Regression Trees) C 5.x (Decision Trees) WEKA (Waikato Environment for Knowledge Analysis) LERS ROSE 2

Rule-Based Expert System Development Environments CLIPS, JESS EXSYS

Web-based Platforms Java MS .Net

11UA Faculty Forum 2008 by C.-C. Chan

Page 12: Web-Based Data Mining System

12

Pre-Processing

LearningData Mining

Inference Engine

End-User Interface

Web-Based Access

Reasoning withUncertaintie

s

MS SQL Server

N/A Decision TreesClustering

N/A N/A N/A N/A

CARTC 5.x

N/A Decision Trees Built-in Embedded N/A N/A

WEKA Yes Trees, Rules, Clustering, Association

N/A Embedded Need Programming

N/A

CLIPSJESS

N/A N/A Built-in Embedded NeedProgramming

3rd parties Extensions

Comparisons

UA Faculty Forum 2008 by C.-C. Chan

Page 13: Web-Based Data Mining System

Rule-Based Data Mining System ObjectivesDevelop an integrated rule-based data mining

system provides Synergy of database systems, machine

learning, and expert systemsDealing with uncertain rulesDelivery of web-based user interface

13UA Faculty Forum 2008 by C.-C. Chan

Page 14: Web-Based Data Mining System

Structure of Rule-Based SystemsStructure of Rule-Based Systems

Rule Base

Working Memory

Execution

Selector

Matcher

No

Yes

Answer

Inference Result

14UA Faculty Forum 2008 by C.-C. Chan

Page 15: Web-Based Data Mining System

15

System Workflow

InputData Set

Data Pre-processing

RuleGenerator

UserInterfaceGenerator

UA Faculty Forum 2008 by C.-C. Chan

Page 16: Web-Based Data Mining System

16

Input Data Set:Text file with comma separated values (CSV)It is assumed that there are N columns of values

corresponding to N variables or parameters, which may be real or symbolic values.

The first N – 1 variables are considered as inputs and the last one is the output variable.

Data Preprocessing:Discretize domains of real variables into a finite number

of intervals Discretized data file is then used to generate an

attribute information file and a training data file.Rule Generator:

A symbolic learning program called BLEM2 is used to generate rules with uncertainty

User Interface Generator:Generate a web-based rule-based system from a rule file

and corresponding attribute file

UA Faculty Forum 2008 by C.-C. Chan

Page 17: Web-Based Data Mining System

Architecture of RBC generator

17

Requests

Middle Tier

Client

Responses

SQL DB server

Workflow of RBC generator

Rule set File

Metadata File

Rule Table Definition

RBC Generator

UA Faculty Forum 2008 by C.-C. Chan

Page 18: Web-Based Data Mining System

Concluding RemarksA system for generating rule-based classifier from data with the following benefits:

No need of end user programmingAutomatic rule-based system creationDelivery system is web-based provides easy

access

18UA Faculty Forum 2008 by C.-C. Chan

Page 19: Web-Based Data Mining System

Project StatusThe current version 1.4 of our system provides fundamental features for data mining from data including:

Data PreprocessingManagement of preprocessed data filesMachine Learning tool to generate rules from

data Rule-Based Classifier system supporting

uncertain rulesWeb-Based access

19UA Faculty Forum 2008 by C.-C. Chan

Page 20: Web-Based Data Mining System

Future WorkMore advanced features in Data

Preprocessing such as data cleansing, data transformation, and data statistics

Learning from multi-criteria inputs with preferential rankings to support Multiple Criteria Decision Making processes

Concept-Oriented information retrieval and search

20UA Faculty Forum 2008 by C.-C. Chan

Page 21: Web-Based Data Mining System

Thank You!

21UA Faculty Forum 2008 by C.-C. Chan