16
These slides are additional material for TIES445 1 Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

  • Upload
    vanig

  • View
    25

  • Download
    3

Embed Size (px)

DESCRIPTION

Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö. A data mining algorithm. ” A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models and patterns” - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 1

Lecture 5

TIES445 Data mining

Nov-Dec 2007

Sami Äyrämö

Page 2: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 2

”A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models and patterns”

– ”Well-defined” indicate that the procedure can be precisely encoded as a finite set of rules

– ”Algorithm”, a procedure that always terminates after some finite number of of steps and produces an output

– ”Computational method” has all the properties of an algorithm except a method for guaranteeing that the procedure will terminate in a finite number of steps (Computational method is usually described more abstactly than

algorithm, e.g., steepest descent is a computational method)

A data mining algorithm

Page 3: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 3

Data mining tasks

Explorative (visualization) Descriptive (clustering, rule finding,…) Predictive (classification, regression,…)

Page 4: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 4

Data mining task Structure of the model or pattern Score function Search/optimization method Data management technique

Elements of data mining algorithms

Page 5: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 5

Structure

Structure (functional form) of the model or pattern that will be fitted to the data

Defines the boundaries of what can be approximated or learned

Within these boundaries, the data guide us to a particular model or pattern

E.g., hierarchical clustering model, linear regression model, mixture model

Page 6: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 6

Structure: decision tree

Figure from the book ”Tan,Steinbach, Kumar, Introduction to Data Mining, Addision Wesley, 2006.”

Page 7: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 7

Structure: MLP

Figures by Tommi Kärkkäinen

Page 8: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 8

Score function

Judge the quality of the fitted models or patterns based on observed data

Minimized/maximized when fitting parameters to our models and patterns

Critical for learning and generalization– Goodness-of-fitness vs. generalization

e.g., the number of neurons in neural network E.g., misclassification error, squared

error,support/accuracy

Page 9: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 9

. and )diag( , ,},...,1{)( where,

,)()}{,}{,(for

),}{,}{,(min

1)(11

11,}{ 1

piii

pki

n

iqii

nii

Kkkq

nii

Kkkq

RRK

J

J

i

iKkk

xpPcI

cxPxcI

xcI

I

Ic

α = 2, q=2 → K-means

α = 1, q=2 → K-spatialmedians

α = 1, q=1 → K-coord.medians

Score functions: Prototype-based clustering

• Different staticical properties of the cluster models• Different algorithms and computational methods for solving

Page 10: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 10

Score function: Overfitting vs. generalization

Figures by Tommi Kärkkäinen

Page 11: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 11

Search/optimization method

Used to search over parameters and structures Computational procedures and algorithms used

to find the maximum/minimum of the score function for particular models or patterns– Includes:

Computational methods used to optimize the score function, e.g., steepest descentSearch-related parameters, e.g., the maximum number of iterations or convergence specification for an iterative algorithm

Single-fixed structure (e.g., kth order polynomial function of the inputs) or family of different structures (i.e., search over both structures and their associated parameters spaces)

Page 12: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 12

Search/optimization: K-means-like clustering

1. Initialize the cluster prototypes

2. Assign each data point to the closest cluster prototype

3. Compute the new estimates (may require another iterative algorithm) for the cluster prototypes

4. Termination: stop if termination criteria are satisfied (usually no changes in I)

Page 13: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 13

Data management technique

Storing, indexing, and retrieving data Not usually specified by statistical or machine

learning algorithms– A common assumption is that the data set is

small enough to reside in the main memory so that random access of any data point is free relative to actual computational costs

Massive data sets may exceed the capacity of available main memory– The physical location of the data and the

manner in which data it is accessed can be critically important in terms of algorithm efficiency

Page 14: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 14

Data management technique: memory

A general categorization of different memory structures

1. Registers of processors: direct acces, no slowdown

2. On-processor or on-board cache: fast semiconductor memory on the same chip as the processor

3. Main memory: Normal semiconductor memory (up to several gigabytes)

4. Disk cache: intermediate storage between main memory and disks

5. Disk memory: Terabytes. Access time milliseconds.

6. Magnetic tape: Access time even minutes.

Page 15: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 15

Data management: index structures

B-trees Hash indices Kd-trees Multidimensional indexing Relational datatables

Page 16: Lecture 5 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö

These slides are additional material for TIES445 16

Examples

CART Backpropagation APriori

Task Classification and regression

Regression Rule pattern discovery

Structure Decision tree Neural network (non-linear function)

Association rules

Score function Cross-validated loss function

Squared error Support/accuracy

Search method Greedy search over structures

Gradient descent on parameters

Breadth-First search

Data management technique

Unspecified Unspecified Linear scans