16
Data mining/machine learning I large data sets I high dimensional spaces I potentially little information on structure I computationally intensive I plots are essential, but require considerable pre-processing I emphasis on means and variances I emphasis on prediction I prediction rules are often automated: e.g. spam filters I Hand, Mannila, Smyth “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner” 1 / 16

Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Embed Size (px)

Citation preview

Page 1: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Data mining/machine learningI large data setsI high dimensional spacesI potentially little information on structureI computationally intensiveI plots are essential, but require considerable pre-processingI emphasis on means and variancesI emphasis on predictionI prediction rules are often automated: e.g. spam filters

I Hand, Mannila, Smyth “Data mining is the analysis of(often large) observational data sets to find unsuspectedrelationships and to summarize the data in novel ways thatare both understandable and useful to the data owner”

1 / 16

Page 2: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Applied Statistics (Sta 302/437/442)I small data setsI low dimensionI lots of information on the structureI plots are useful, and easy to useI emphasis on likelihood using probability distributionsI emphasis on inference: often maximum likelihood

estimators (or least squares estimators)I inference results highly ‘user-intensive’: focus on scientific

understanding

2 / 16

Page 3: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Some common methodsI regression (Sta 302)I classification (Sta 437 – applied multivariate)I kernel smoothing (Sta 414, App Stat 2)

I neural networksI Support Vector MachinesI kernel methodsI nearest neighboursI classification and regression treesI random forestsI ensemble learning

3 / 16

Page 4: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

More quotesI HMS1: “Data mining is often set in the broader context of

knowledge discovery in databases, or KDD”Example: Irvine web

I Data mining tasks: exploratory data analysis; descriptivemodelling (ca, seg), descriptive modelling (classification,regression), pattern detection, retrieval by content,recommender systems

I Bishop (pattern recognition): training/learning phase, targetvector, generalization, pre-processing/feature extraction

I Haughton et al., software for data mining: EnterpriseMiner, XLMiner, Quadstone, GhostMiner, Clementine

I

Enterprise Miner $40,000-$100,000XLMiner $1,499 (educ)Quadstone $200,000 - $1,000,000GhostMiner $2,500 – $30,000 + annual fees

1refs on last slide4 / 16

Page 5: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Examples from textI email spam 4601 messages, frequencies of 57 commonly

occurring wordsI prostrate cancer 97 patients, 9 covariatesI wine data 178 wines, 3 cultivars, 13 covariatesI handwriting recognition dataI DNA microarray data 64 samples, 6830 genes

I “Learning” = building a prediction model to predict anoutcome for new observations

I “Supervised learning” = regression or classification: haveavailable sample of outcome (y )

I “Unsupervised learning” = clustering: searching forstructure in the data

go to figures1.pdf and ESLII.pdf

5 / 16

Page 6: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Elements of Statistical Learning c©Hastie, Tibshirani & Friedman 2001 Chapter 1

lpsa

-1 1 3

oooooo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo ooo oo ooo oo ooo oo o ooooooo o oooo ooo o ooo o ooooooo oooo ooo oo ooo

ooo o

o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo ooo ooooo oooo ooo oo oo o oo ooo ooooooooo oo oo ooooo o oooooo oo ooo

oooo40 60 80

o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo oo o ooo oo ooo oooo oo ooo oo oo oooo o oooo ooo oo o ooooo ooo oooo ooo oo o oo

oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo oooo oo oo ooo oo oooo oo oo o ooo oo o o oo oo oo oo o oooo ooo o oo ooo oo oooo oo

0.0 0.4 0.8

oooooooooooooooooooooooooooooooooooooo oooooooo ooooooooooooooo oo ooooooo oo oooooo oooo ooo oo oooo oo

oooo

oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo oooo o ooo ooo ooo o ooo ooooo oo oo o ooo o oo o o o ooo ooo oo oo oo o oooo o ooo o

6.0 7.5 9.0

oo oooooooooo oooo ooooo oo ooo ooooooooo o oooo ooo ooo oooo oo oooooo ooo o ooo oooo oooo oooooooooo oooo oooo oo

oooo

01

23

45

oo oooooooooo oooo ooooo oo ooo oo oooooooooo oo oooo oo oooo oo oooooo o oo o ooo o oooo ooo oo ooo ooo oo oo oo o o oo o ooo oo

-10

12

34

oooo

o

o

oo

ooo

o

oooo

o

o

oooo

o

o

oooo

o

o

oo

o

oo

ooo

o

oooo

oooo

ooooo

o

oo

oooooo

ooooooo

o

ooooo

oo

oooo

oooo

o

o

oo

o

o

ooo

oooo

lcavol

o oo

o

o

o

oo

ooo

o

oo oo

o

o

oo

oo

o

o

oo

oo

o

o

o o

o

oo

ooo

o

oooo

oooo

ooo o

o

o

oo

ooo oo

o

oo

ooooo

o

oo

o oo

oo

oooo

oo oo

o

o

oo

o

o

oooo

ooo

o oo

o

o

o

oo

ooo

o

o oo o

o

o

oo

oo

o

o

oo

oo

o

o

oo

o

o o

o oo

o

o oo

o

o ooo

oo

ooo

o

oo

ooo o

oo

oo

ooo

o o

o

oo

ooo

oo

oooo

oo oo

o

o

oo

o

o

oo o

oo o

o

oooo

o

o

o o

ooo

o

oooo

o

o

oo

oo

o

o

oo

oo

o

o

oo

o

oo

ooo

o

ooo

o

o ooo

oo

ooo

o

oo

ooooo

o

oo

ooo

o o

o

oo

ooo

o o

oooo

ooo o

o

o

oo

o

o

ooo

oo o

o

oooo

o

o

oo

ooo

o

oooo

o

o

oooo

o

o

oooo

o

o

oo

o

oo

ooo

o

oooo

oooo

ooooo

o

oo

oooooo

oo

ooooo

o

oo

o oo

oo

oo oo

oo oo

o

o

o o

o

o

ooooooo

oooo

o

o

oo

ooo

o

oo oo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

o o

o oo

o

ooo

o

oooo

oo

ooo

o

oo

o ooooo

oo

ooo

oo

o

oo

o oo

o o

oo oo

ooo o

o

o

o o

o

o

oo o

ooo

o

ooo

o

o

o

oo

ooo

o

oooo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

oo

o oo

o

o oo

o

oooo

ooo oo

o

oo

ooo o

oo

oo

ooooo

o

oooo

o

oo

oooo

ooo o

o

o

o o

o

o

ooooooo

ooo

o

o

o

oo

ooo

o

oooo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

oo

ooo

o

o oo

o

o ooo

ooo oo

o

oo

ooo o

oo

oo

ooo

oo

o

oo

ooo

o o

oo oo

ooo o

o

o

o o

o

o

oo o

oo o

o

oooooooooo

ooooooooooooooooooooo

o

ooo

oo

o

o

ooooooooooooooooo

o

oooo

ooooooooo

oooooo

oooooooooooo

o

oooooo

oo

oo

oo oo ooo o

oooo

oo

o oo

oo oo ooo

ooo o

o

o

ooo

oo

o

o

oooo ooo

oooo

o oo

oo

o

o

oo

oo

o oooo oo

oo

ooo

ooo

oooo

ooooo oo

o

o

oo

oo ooo o lweight

oo

oooo ooo o

ooo o

oo

ooo

oooo o o

ooooo

o

o

oo o

oo

o

o

o oooo

oooo

o ooo

oo

oo

o

oo

oo

o oooo o ooo

o oo

ooo

o ooo

oo

ooo ooo

o

oo

o ooo

oo

oooooo o oooo ooooo

ooo

oo oo o o

oo o

ooo

o

ooo

oo

o

o

ooo oo

oooo

o oo o

oo

oo

o

oo

oo

oooo o o o

oo

o oo

oo

o

oooo

oo

o o oo oo

o

oo

oooo

oo

ooooooooooooooooooooooooooooooo

o

ooo

oo

o

o

ooooooo

oooooooooo

o

oooo

oooooo

ooo

oooooo

oooo

oo

ooo ooo

o

oo

oooooo

oooooooooooo

ooo

oo o

ooo oo ooo

ooo o

o

o

oo o

oo

o

o

ooo o o

oooo

o ooo

oo

oo

o

oooo

oooo o oo

oo

ooo

oo

o

oooo

oo

o oo ooo

o

ooo o o

oo o

oo

oooooooooo

ooo

ooo

ooo oo ooo

oooo

o

o

ooo

oo

o

o

o ooo ooo

oooo

oooooo

o

ooo

o

o ooo ooo

oo

ooo

ooo

ooooooo ooooo

o

oo

oooooo

34

56

oo

oooooooooo

ooo

ooo

ooo oo ooo

oooo

o

o

ooo

oo

o

o

o oooo

oooo

oooo

oo

oo

o

oo

oo

o ooo o oo

oo

ooo

oo

o

oooo

oo

o oo ooo

o

oo

o ooo

oo

4050

6070

80

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

ooooooooooooo

o

oo

o

oo

oo

ooooo

o

o

o

ooooo

oo

oo

o

o

o

o

oooooooo

o

o

o

o

oooo

ooo

o

o

oooooo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

o ooo o oooo

o

o o

o

oo

oo

oo o

oo

o

o

o

oooo

o

oo

oo

o

o

o

o

ooooooo o

o

o

o

o

ooo

o

ooo

o

o

oo

oo

o o

o

ooo

oo

o o

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

oooooo oooo o o

o

o

o o

o

oo

oo

ooo

oo

o

o

o

o oooo

oo

o o

o

o

o

o

oooooooo

o

o

o

o

oooo

ooo

o

o

oo

oo

o o

o

ooo

oo

oo

ageo

o

o

oo

o

oo

o

oo ooo

o

oo

o

o

o

o ooo

ooo ooo ooo

o

o o

o

oo

oo

oooo

o

o

o

o

oooo

o

oo

o o

o

o

o

o

ooo o

ooo o

o

o

o

o

o ooo

oo o

o

o

oo

oo

oo

o

oo

o

oo

oo

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

ooooooooooooo

o

oo

o

oo

oo

ooooo

o

o

o

ooooo

oo

oo

o

o

o

o

oooo

oooo

o

o

o

o

oooo

ooo

o

o

oo

oo

oo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo oo o oooo

o

oo

o

oo

oo

oo oo

o

o

o

o

ooo o

o

oo

oo

o

o

o

o

ooo o

ooo o

o

o

o

o

o oo

o

ooo

o

o

oo

oo

o o

o

oo

o

oo

o o

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo ooooooo

o

o o

o

oo

oo

oo o

oo

o

o

o

o oo o

o

oo

oo

o

o

o

o

ooo oooo o

o

o

o

o

oooo

ooo

o

o

ooo

ooo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo oo ooooo

o

oo

o

oo

oo

ooo

oo

o

o

o

o oo o

o

oo

oo

o

o

o

o

ooo o

oooo

o

o

o

o

o oo

o

ooo

o

o

oo

oo

o o

o

oo

o

oo

oo

oooooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

oooo oo

o

o

o oo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

o oo

o

o

o oo ooo

o

o

ooo

o

oo oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o oo

oo

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

ooo

o

o

o o oooo

o

o

o oo

o

o oo o

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o o

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

oo

oo

o

oo

o

o

oo

o

o

o

oo o

o

o lbph

oooooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

oooooo

o

o

ooo

o

oo oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o o

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

o oo

o

o

oo oooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

-10

12

oo oooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

ooo

o

o

0.0

0.4

0.8

oooooooooooooooooooooooooooooooooooooo

o

ooooooo

o

oooooooooooooo

o

o

o

oooooo

o

o

oooo

oo

o

ooo

o

oo

o

o

ooo

o

oooooo

oooo oo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo

o

oo oo ooo

o

o ooo oo o ooooooo

o

o

o

oo ooo o

o

o

o o oo

oo

o

oo o

o

oo

o

o

o oo

o

oo ooo o

o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo

o

oo ooooo

o

ooo ooo oo oo o oo o

o

o

o

oooooo

o

o

oo oo

oo

o

oo o

o

oo

o

o

o oo

o

oooooo

o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo

o

o o ooo oo

o

oo oooo oo ooo oo o

o

o

o

oo o ooo

o

o

oo oo

o o

o

ooo

o

oo

o

o

oo o

o

o oo o oo

oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo

o

ooo oo oo

o

oo oo oooo oo oo o o

o

o

o

o o o oo o

o

o

o oo o

oo

o

o oo

o

o o

o

o

oo o

o

oooo oo

svi

oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo

o

ooo o ooo

o

oo ooo o ooo ooooo

o

o

o

o o ooo o

o

o

o o o o

oo

o

oo o

o

oo

o

o

o oo

o

o o ooo o

oo oooooooooo oooo ooooo oo ooo ooooooooo o oo

o

o ooo ooo

o

ooo oo oooooo ooo

o

o

o

o oooo o

o

o

o ooo

oo

o

ooo

o

oo

o

o

ooo

o

oooooo

oo oooooooooo oooo ooooo oo ooo oo oooooooooo

o

o oooo oo

o

ooo oo oooooo o oo

o

o

o

o o oooo

o

o

o oo o

oo

o

oo o

o

oo

o

o

o o o

o

o ooo oo

oooooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

oo

o

oooo

ooo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

oooo oo ooo ooooo

o

oo

o

o o o

o

o

o

o oo

o

o

o

oo ooo

o

o

o

o

o

o o

o

o

o

o

o

o

ooo o

o

o

oo

o

oooo

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

o oo ooooooooooo

o

oo

o

o oo

o

o

o

oooo

o

o

oooo

oo

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o oo o

ooo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

o o oooo ooo oooo

o

o

oo

o

o oo

o

o

o

oooo

o

o

oo oo

oo

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o oo o

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo o

o

o

oooooo o oooo ooo

o

oo

o

o oo

o

o

o

ooo

o

o

o

oooo

oo

o

o

o

o

oo

o

o

o

o

o

o

oo

o o

o

o

o o

o

oo o o

ooo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

oooooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

oo

o

oooo

ooo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

lcp

oo oooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

ooooo

o

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o ooo

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

-10

12

3

oo ooooooooooo

o

o

oo

o

ooo

o

o

o

ooo

o

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o o oo

o oo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

6.0

7.0

8.0

9.0

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

ooo

oooooo

o

o

ooo

o

o

o

ooo

o

o

oo

o

o

ooooo

o

oo

o

o

o

o

o

ooo

o

oooo

o

ooooooooo

o

oo

o

ooo

o

oooooo

oo

o

o oo ooo ooo

ooo

o

o

oo o o

o

o

o

o o

oo o

ooo ooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

o o ooo

o

oo

o

o

o

o

o

o oo

o

o ooo

o

ooooooo oo

o

o o

o

o oo

o

oo ooo o

o o

o

ooooooooo

oo o

o

o

oo oo

o

o

o

oo

ooo

o o oooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

o oo oo

o

oo

o

o

o

o

o

ooo

o

ooo o

o

oo ooooo o o

o

oo

o

o oo

o

oooooo

o o

o

ooo ooo ooo

o oo

o

o

oo oo

o

o

o

oo

ooo

ooo oo o

o

o

o oo

o

o

o

o oo

o

o

o o

o

o

o oo oo

o

oo

o

o

o

o

o

o o o

o

oo oo

o

oo o ooooo o

o

o o

o

oo o

o

o oo o oo

oo

o

ooo o oooo o

ooo

o

o

oo oo

o

o

o

oo

o oo

o ooooo

o

o

o oo

o

o

o

o oo

o

o

o o

o

o

ooo oo

o

o o

o

o

o

o

o

o o o

o

oo oo

o

o o oooo ooo

o

oo

o

oo o

o

oooo oo

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

ooo

oooooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

ooooo

o

oo

o

o

o

o

o

ooo

o

o oo o

o

oooo oooo o

o

o o

o

ooo

o

oooooo

oo

o

ooooooooo

oo o

o

o

oooo

o

o

o

oo

oo o

ooooo o

o

o

o oo

o

o

o

ooo

o

o

o o

o

o

o ooo o

o

oo

o

o

o

o

o

o oo

o

o oo o

o

o ooo ooo oo

o

o o

o

o oo

o

o o ooo ogleason

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

oo o

oooooo

o

o

o oo

o

o

o

o oo

o

o

oo

o

o

ooooo

o

o o

o

o

o

o

o

o oo

o

o ooo

o

o ooo ooo oo

o

o o

o

o o o

o

o ooo oo

0 2 4oooooooooooo

o

ooo

o

oooooo

o

oo

o

o

o

oooooooooo

o

o

ooooo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

ooo

o

oo

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

oooo

o oo ooo ooo

o

ooo

o

oo o oo

o

o

o o

o

o

o

ooo ooo ooo

o

o

o

oo o

oo

o

o

oo

o

o

o

o

oo

ooo

o

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

o

3 4 5 6o oo

ooooooooo

o

o oo

o

oo oooo

o

oo

o

o

o

o o oooo ooo

o

o

o

ooo

oo

o

o

oo

o

o

o

o

oo

o oo

o

o

o

o

o

oo

o

ooo

o

o o

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

oo o

oooo ooo ooo

o

oo o

o

oo oooo

o

oo

o

o

o

ooo oo oooo

o

o

o

ooo

oo

o

o

o o

o

o

o

o

oo

ooo

o

o

o

o

o

oo

o

ooo

o

oo

o

o

oo

o

ooo

o

o

o

o

o

o

o

oo

oo

oo

o

o

o

-1 0 1 2oooooo o oooo o

o

ooo

o

oo ooo

o

o

oo

o

o

o

o ooooo ooo

o

o

o

ooooo

o

o

o o

o

o

o

o

oo

o oo

o

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

ooooooooooooo

o

ooo

o

oooooo

o

oo

o

o

o

ooooooooo

o

o

o

ooooo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

ooo

o

o o

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

-1 1 2 3oooooooooooo

o

o oo

o

ooooo

o

o

oo

o

o

o

ooooo ooooo

o

o

oo ooo

o

o

o o

o

o

o

o

oo

oooo

o

o

o

o

oo

o

oo o

o

o o

o

o

oo

o

ooo

o

o

o

o

o

o

o

oo

oo

oo

o

o

ooo

oooooooooo

o

ooo

o

ooooo

o

o

oo

o

o

o

oooooo ooo

o

o

o

oo o

oo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

0 40 800

2060

100

pgg45

Figure 1.1: Scatterplot matrix of the prostate cancer

data. The first row shows the response against each of

the predictors in turn. Two of the predictors, svi and

gleason, are categorical.

6 / 16

Page 7: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Some of the wine data:1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,10651,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,10501,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,11851,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,14801,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,7351,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,14501,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,12901,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,12951,14.83,1.64,2.17,14,97,2.8,2.98,.29,1.98,5.2,1.08,2.85,10451,13.86,1.35,2.27,16,98,2.98,3.15,.22,1.85,7.22,1.01,3.55,10451,14.1,2.16,2.3,18,105,2.95,3.32,.22,2.38,5.75,1.25,3.17,15101,14.12,1.48,2.32,16.8,95,2.2,2.43,.26,1.57,5,1.17,2.82,12801,13.75,1.73,2.41,16,89,2.6,2.76,.29,1.81,5.6,1.15,2.9,13201,14.75,1.73,2.39,11.4,91,3.1,3.69,.43,2.81,5.4,1.25,2.73,11501,14.38,1.87,2.38,12,102,3.3,3.64,.29,2.96,7.5,1.2,3,15471,13.63,1.81,2.7,17.2,112,2.85,2.91,.3,1.46,7.3,1.28,2.88,13101,14.3,1.92,2.72,20,120,2.8,3.14,.33,1.97,6.2,1.07,2.65,12801,13.83,1.57,2.62,20,115,2.95,3.4,.4,1.72,6.6,1.13,2.57,11301,14.19,1.59,2.48,16.5,108,3.3,3.93,.32,1.86,8.7,1.23,2.82,16801,13.64,3.1,2.56,15.2,116,2.7,3.03,.17,1.66,5.1,.96,3.36,8451,14.06,1.63,2.28,16,126,3,3.17,.24,2.1,5.65,1.09,3.71,7801,12.93,3.8,2.65,18.6,102,2.41,2.41,.25,1.98,4.5,1.03,3.52,7701,13.71,1.86,2.36,16.6,101,2.61,2.88,.27,1.69,3.8,1.11,4,10351,12.85,1.6,2.52,17.8,95,2.48,2.37,.26,1.46,3.93,1.09,3.63,10151,13.5,1.81,2.61,20,96,2.53,2.61,.28,1.66,3.52,1.12,3.82,845

1,13.05,2.05,3.22,25,124,2.63,2.68,.47,1.92,3.58,1.13,3.2,830

7 / 16

Page 8: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

1 5 10 30 1.0 0.2 2 12 1.5

1114

14

ash

1.53.0

1025

80160

1.03.5

14

0.2

0.53.5

28

hue

0.61.6

1.54.0

11 1.5 80 1 5 0.5 0.6 400

400

8 / 16

Page 9: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

alcohol

1 2 3 4 5 6 10 15 20 25 30

1112

1314

12

34

56

malic acid

ash

1.52.0

2.53.0

11 12 13 14

1015

2025

30

1.5 2.5

alcalinity of ash

9 / 16

Page 10: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Some of the handwriting data

10 / 16

Page 11: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Some basic definitionsI input X (features), output Y (response)I data (xi , yi), i = 1, . . .NI use data to assign a rule (function) taking X → YI goal is to predict a new value of Y , given X : Y (X )

I regression if Y is continuousclassification if Y is discrete

I Examples...

11 / 16

Page 12: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

How to know if the rule works?I compare yi to yi on data training error

I compare Y to Y on new data test error

I In ordinary linear regression, the least squares estimatesminimize

Σ(yi − xTi β)2

I and the minimized value is the sum of the squaredresiduals

Σi(yi − yi)2 = Σi(yi − xT

i β)2

I test error isΣj(Yj − xT

j β)2

12 / 16

Page 13: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

... training and test errorI In classification: use training data (xi , yi) to estimate a rule

G(·)I G(x) takes a new observation and assigns a category

(target) g ∈ G for the corresponding new yI for example, the simplest version of linear discriminant

analysis saysyj = 1 if xT

j β > c else y = 0I what criterion function does this minimize?I test error is proportion of mis-classifications in the test data

setI in regression or in linear discriminant analysis, training

error is always smaller than test error: why?

13 / 16

Page 14: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Linear regression with polynomialsI model yi = β0 + β1xi + β2x2

i + β3x3i + · · ·+ εi

I assume εi ∼ N(0, σ2) (independent)I how to choose degree of polynomial?I for fixed degree, find β by least squares

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

-0.5

0.0

0.5

1.0

x

y

go to R 14 / 16

Page 15: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

For next class/weekI Thursday: Make sure you can start RI Thursday: Send me email if you want a Cquest accountI Read Chapter 1, 2.1–2.3 of HTF

www.utstat.utoronto.ca/reid/sta414/ESLII-2.pdf

I Read Chapter 1 of Supercrunchers (handout)I Read “Data mining research: current status and future

opportunities” (handout)I We will cover Ch. 3.1-3.3/4 of HTF

15 / 16

Page 16: Data mining/machine learning - utstat.utoronto.cautstat.utoronto.ca/reid/sta414/2008-9/jan6.pdf · Data mining/machine learning I large data sets I high dimensional spaces I potentially

Some referencesI Bishop, C. (2006). Pattern Recognition and Machine

Learning. Springer-Verlag. research.microsoft.com/en-us/

um/people/cmbishop/PRML/index.htm

I Hand, D., Mannila, H. and Smyth, P. (2001). Principals ofData Mining. MIT Press.

I www.stat.columbia.edu/˜madigan/DM08David Madigan’s DM course (with links to other courses)

I www-stat.stanford.edu/˜tibs/stat315a.htmlRob Tibshirani’s Statistical Learning Course

go to web site

I Zhu, Mu (2008). Kernels and ensembles: perspectives onstatistical learning. American Statistician 62, 97–109.

I http://archive.ics.uci.edu/ml/UC Irvine machine learning repository of data sets

I Haughton, D. and 5 others (2003). A review of softwarepackages for data mining. American Statistician, 57,290–309. 16 / 16