Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
IntroductionExamples
574M: Introduction to Statistical MachineLearning
Hao Helen Zhang
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 1 / 26
IntroductionExamples
Course Information
Course homepage:http://www.math.arizona.edu/∼hzhang/math574m.html
Prerequisite:
MATH 464, MATH 466, or equivalentProgramming language: R (recommended)
Textbooks: The Elements of Statistical Learning (electronicversion available at course website)
Reference Books:1 Principle and Theory for Data Mining and Machine Learning
by Clark, Forkoue, Zhang (2009)2 Pattern Recognition and Neural Networks by B. Ripley (1996)3 Learning with Kernels by Scholkopf and Smola (2000)4 Nature of Statistical Learning Theory by Vapnik (1998)
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 2 / 26
IntroductionExamples
Big Data Features and Challenges
Big data are featured with
massive volume, ultra-high dimension
complex structure, heterogeneous
multiple-source, multiple-type data
Big Data Challenges:
data storage, transfer [enormous memory, parallel processing],search, retrieval [at a high speed]
data cleaning, organization, visualization [present/display datain a meaningful way]
data analysis [gain deep insight about data]
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 3 / 26
IntroductionExamples
Data Scientists
Data Science is an interdisciplinary field:
Data Analyst: visualize, analyze data, optimization
Data Engineers: build and test scalable/stable/optimal BigData ecosystems for data scientists to run their algorithms onthe data systems
Database Administrator: responsible for the properfunctioning of all the databases
Data Scientist: perform predictive analysis and offeractionable insights.
Statistician: extract and offer valuable insights from the datausing statistical theory and tools
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 4 / 26
IntroductionExamples
Statistical Machine Learning (ML)
Plays a central role in data mining.
extract patterns from data using statistical and probabilistictheory and tools
provide theoretical foundations for learning algorithms
develop useful tools to analyze an algorithm’sstatistical properties and performance guarantee
help researchers gain deeper understanding of the approaches,design better algorithms, and select appropriate methods for agiven problem
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 5 / 26
IntroductionExamples
What is Special about this course
To combine the “art” of designing good learning algorithms andthe “science” of analyzing statistical properties and performance ofthe approaches
emphasize statistical principles behind the approaches andalgorithms
learn how to formulate a learning problem in a statisticalframework
understand existing techniques from a statistical perspective:what, why, and when
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 6 / 26
IntroductionExamples
New Challenges to Statistical ML
Data Complexity: involves many variables which are oftenrelated in complex (nonlinear) ways.
Feature Selection: many features are available but some areredundant, leading to the feature selection or dimensionreduction problem.
Optimization: many methods involve finding the “best”parameters values by solving complex and large (containingmany parameters) optimization problems. Therefore, efficientoptimization techniques are required.
Visualization: much harder in the high dimensional space.
This is the so-called curse of dimensionality.
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 7 / 26
IntroductionExamples
Course Goals
understand different types of data mining problems
supervised, unsupervised, semi-supervised learningregression, classification, clustering, graphical modelstext mining, multimedia mining, image recognition, socialnetworks, recommendation system, network models
learn basic statistical concepts and principles (which arefavored over a black-box learning algorithm)
statistical inference on uncertainty, distribution, loss, riskmodel building, evaluation, selection, prediction, andoptimality; bias-variance tradefoffparameter tuning, training error, test error, cross-validation,
learn statistical and machine learning methods for big data
SVM, PCA, lasso, boosting, tree, random forest
learn R software packages to analyze data
take into serious consideration scalable, parallel algorithms
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 8 / 26
IntroductionExamples
Terminologies
Statistics Machine Learning
X predictor, covariate input, feature
Y response, outcome output
{(Xi ,Yi )}ni=1 data points, samples examples, training data
f̂ (X) model fitting machine learning
data analysis consistency, inference prediction, speedgoals confidence interval risk bound
convergence rate learning theory
Practical vs Reluctant to use methods Willing to use ad hocTheoretical methods without theoretical methods if they seemsConcerns justification (even if the to work well (though
justification is appearances may beactually meaningless) misleading)
Computing more and more important heavy
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 9 / 26
IntroductionExamples
Different Learning Problems
Typically, we collect data (Xi ,Yi ), i = 1, · · · , n.
Yi is the outcome or response variable.
Xi is the input or prediction variables.
Various Machine Learning Problems
Supervised learning (Y observed)
Unsupervised learning (Y unobserved)
Semi-supervised learning (Y partially available)
Various Statistical problems
Regression (Y quantitative, a numerical quantity)
Classification (Y qualitative; a class label)
Density estimation (no Y )
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 10 / 26
IntroductionExamples
Supervised Learning vs Unsupervised Learning
Supervised Learning Unsupervised Learning
Response Y observed unobserved
Major predict Y , given find interestingGoal the observed input X patterns in data
Examples linear regression clusteringnonparametric regression density estimation
classification dimension reduction
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 11 / 26
IntroductionExamples
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 12 / 26
IntroductionExamples
Which Type of Learning Problems
Predict whether a patient, hospitalized due to a heart attack,will have a second heart attack, based on demographic, dietand clinical measurements
Predict a stock price based on company performance andeconomic data
Identify a handwritten ZIP code, from a digitized image
Classify a recorded phoneme based on a log-periodogram
Identify cancer risk factors based on clinical, genetic anddemographic variables
Identify groups of genes with similar functions from DNAmicroarray data
Establish the relationship between salary and demographicvariables in population survey.
In fraud detection, predict the probability of a fraud order
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 13 / 26
IntroductionExamples
Example 1: Email or Spam (Textbook Page 2)
Training data: 4, 601 email messages with known email type.
Input X: the relative frequencies of 57 of the most commonlyoccurring words and punctuation marks in the email message.Outcome Y : −1 =email , + =spam
Goal: Design an automatic spam detector to predict whethera message is email or spam. In the future, the detector can beused to filter out the junk emails before clogging the users’mailbox.
Supervised learning, binary classification, n > p. (data matrixn × p “long and narrow”)
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 14 / 26
IntroductionExamples
Table: Average percentage of words (the largest difference between spam and email
george you your hp free hpl ! our re edu removespam 0.00 2.26 1.38 0.02 0.52 0.01 0.51 0.51 0.13 0.01 0.28email 1.27 1.27 0.44 0.90 0.07 0.43 0.11 0.18 0.42 0.29 0.01
Possible classification rules:
if (george < 0.6) & (you > 1.5), then spam; otherwise email.
if (0.2 you -0.3 george)> 0, then spam; otherwise email.
Two types of decision errors:
(1) false positive: classify email to spam (filter out email)
(2) false negative: classify spam to email (email box jammed)
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 15 / 26
IntroductionExamples
Example 2: Prostate Cancer
Read Textbook page 3
Training data: 97 male patient with different stages ofprostate cancer.
Input X: eight clinical measures: log-cancer volume (lcavol),log prostate weight (lweight), age, and the other five.
Outcome Y : the log of the level of prostate specific antigen(lpsa)
Questions of interest:
What is the relationship between lpsa and clinical measures?
Is the linear model sufficient? Nonlinear effect? Interactions?
Which clinical measures are more relevant to the prediction?
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 16 / 26
IntroductionExamplesElements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 1
lpsa
-1 1 3
oooooo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo ooo oo ooo oo ooo oo o ooooooo o oooo ooo o ooo o ooooooo oooo ooo oo ooo
ooo o
o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo ooo ooooo oooo ooo oo oo o oo ooo ooooooooo oo oo ooooo o oooooo oo ooo
oooo40 60 80
o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo oo o ooo oo ooo oooo oo ooo oo oo oooo o oooo ooo oo o ooooo ooo oooo ooo oo o oo
oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo oooo oo oo ooo oo oooo oo oo o ooo oo o o oo oo oo oo o oooo ooo o oo ooo oo oooo oo
0.0 0.4 0.8
oooooooooooooooooooooooooooooooooooooo oooooooo ooooooooooooooo oo ooooooo oo oooooo oooo ooo oo oooo oo
oooo
oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo oooo o ooo ooo ooo o ooo ooooo oo oo o ooo o oo o o o ooo ooo oo oo oo o oooo o ooo o
6.0 7.5 9.0
oo oooooooooo oooo ooooo oo ooo ooooooooo o oooo ooo ooo oooo oo oooooo ooo o ooo oooo oooo oooooooooo oooo oooo oo
oooo
0123
45
oo oooooooooo oooo ooooo oo ooo oo oooooooooo oo oooo oo oooo oo oooooo o oo o ooo o oooo ooo oo ooo ooo oo oo oo o o oo o ooo oo
-10
12
34
oooo
o
o
oo
ooo
o
oooo
o
o
oooo
o
o
oooo
o
o
oo
o
oo
ooo
o
oooo
oooo
ooooo
o
oo
oooooo
ooooooo
o
ooooo
oo
oooo
oooo
o
o
oo
o
o
ooo
oooo
lcavol
o oo
o
o
o
oo
ooo
o
oo oo
o
o
oo
oo
o
o
oo
oo
o
o
o o
o
oo
ooo
o
oooo
oooo
ooo o
o
o
oo
ooo oo
o
oo
ooooo
o
oo
o oo
oo
oooo
oo oo
o
o
oo
o
o
oooo
ooo
o oo
o
o
o
oo
ooo
o
o oo o
o
o
oo
oo
o
o
oo
oo
o
o
oo
o
o o
o oo
o
o oo
o
o ooo
oo
ooo
o
oo
ooo o
oo
oo
ooo
o o
o
oo
ooo
oo
oooo
oo oo
o
o
oo
o
o
oo o
oo o
o
oooo
o
o
o o
ooo
o
oooo
o
o
oo
oo
o
o
oo
oo
o
o
oo
o
oo
ooo
o
ooo
o
o ooo
oo
ooo
o
oo
ooooo
o
oo
ooo
o o
o
oo
ooo
o o
oooo
ooo o
o
o
oo
o
o
ooo
oo o
o
oooo
o
o
oo
ooo
o
oooo
o
o
oooo
o
o
oooo
o
o
oo
o
oo
ooo
o
oooo
oooo
ooooo
o
oo
oooooo
oo
ooooo
o
oo
o oo
oo
oo oo
oo oo
o
o
o o
o
o
ooooooo
oooo
o
o
oo
ooo
o
oo oo
o
o
ooo
o
o
o
oo
oo
o
o
oo
o
o o
o oo
o
ooo
o
oooo
oo
ooo
o
oo
o ooooo
oo
ooo
oo
o
oo
o oo
o o
oo oo
ooo o
o
o
o o
o
o
oo o
ooo
o
ooo
o
o
o
oo
ooo
o
oooo
o
o
ooo
o
o
o
oo
oo
o
o
oo
o
oo
o oo
o
o oo
o
oooo
ooo oo
o
oo
ooo o
oo
oo
ooooo
o
oooo
o
oo
oooo
ooo o
o
o
o o
o
o
ooooooo
ooo
o
o
o
oo
ooo
o
oooo
o
o
ooo
o
o
o
oo
oo
o
o
oo
o
oo
ooo
o
o oo
o
o ooo
ooo oo
o
oo
ooo o
oo
oo
ooo
oo
o
oo
ooo
o o
oo oo
ooo o
o
o
o o
o
o
oo o
oo o
o
oooooooooo
ooooooooooooooooooooo
o
ooo
oo
o
o
ooooooooooooooooo
o
oooo
ooooooooo
oooooo
oooooooooooo
o
oooooo
oo
oo
oo oo ooo o
oooo
oo
o oo
oo oo ooo
ooo o
o
o
ooo
oo
o
o
oooo ooo
oooo
o oo
oo
o
o
oo
oo
o oooo oo
oo
ooo
ooo
oooo
ooooo oo
o
o
oo
oo ooo o lweight
oo
oooo ooo o
ooo o
oo
ooo
oooo o o
ooooo
o
o
oo o
oo
o
o
o oooo
oooo
o ooo
oo
oo
o
oo
oo
o oooo o ooo
o oo
ooo
o ooo
oo
ooo ooo
o
oo
o ooo
oo
oooooo o oooo ooooo
ooo
oo oo o o
oo o
ooo
o
ooo
oo
o
o
ooo oo
oooo
o oo o
oo
oo
o
oo
oo
oooo o o o
oo
o oo
oo
o
oooo
oo
o o oo oo
o
oo
oooo
oo
ooooooooooooooooooooooooooooooo
o
ooo
oo
o
o
ooooooo
oooooooooo
o
oooo
oooooo
ooo
oooooo
oooo
oo
ooo ooo
o
oo
oooooo
oooooooooooo
ooo
oo o
ooo oo ooo
ooo o
o
o
oo o
oo
o
o
ooo o o
oooo
o ooo
oo
oo
o
oooo
oooo o oo
oo
ooo
oo
o
oooo
oo
o oo ooo
o
ooo o o
oo o
oo
oooooooooo
ooo
ooo
ooo oo ooo
oooo
o
o
ooo
oo
o
o
o ooo ooo
oooo
oooooo
o
ooo
o
o ooo ooo
oo
ooo
ooo
ooooooo ooooo
o
oo
oooooo
34
56
oo
oooooooooo
ooo
ooo
ooo oo ooo
oooo
o
o
ooo
oo
o
o
o oooo
oooo
oooo
oo
oo
o
oo
oo
o ooo o oo
oo
ooo
oo
o
oooo
oo
o oo ooo
o
oo
o ooo
oo40
5060
7080
o
o
o
oo
o
oo
o
ooooo
o
ooo
o
o
ooooooooooooo
o
oo
o
oo
oo
ooooo
o
o
o
ooooo
oo
oo
o
o
o
o
oooooooo
o
o
o
o
oooo
ooo
o
o
oooooo
o
ooo
oo
oo
o
o
o
oo
o
oo
o
ooo oo
o
oo
o
o
o
o ooo
o ooo o oooo
o
o o
o
oo
oo
oo o
oo
o
o
o
oooo
o
oo
oo
o
o
o
o
ooooooo o
o
o
o
o
ooo
o
ooo
o
o
oo
oo
o o
o
ooo
oo
o o
o
o
o
oo
o
oo
o
ooooo
o
ooo
o
o
oooooo oooo o o
o
o
o o
o
oo
oo
ooo
oo
o
o
o
o oooo
oo
o o
o
o
o
o
oooooooo
o
o
o
o
oooo
ooo
o
o
oo
oo
o o
o
ooo
oo
oo
ageo
o
o
oo
o
oo
o
oo ooo
o
oo
o
o
o
o ooo
ooo ooo ooo
o
o o
o
oo
oo
oooo
o
o
o
o
oooo
o
oo
o o
o
o
o
o
ooo o
ooo o
o
o
o
o
o ooo
oo o
o
o
oo
oo
oo
o
oo
o
oo
oo
o
o
o
oo
o
oo
o
ooooo
o
ooo
o
o
ooooooooooooo
o
oo
o
oo
oo
ooooo
o
o
o
ooooo
oo
oo
o
o
o
o
oooo
oooo
o
o
o
o
oooo
ooo
o
o
oo
oo
oo
o
ooo
oo
oo
o
o
o
oo
o
oo
o
ooo oo
o
oo
o
o
o
o ooo
oo oo o oooo
o
oo
o
oo
oo
oo oo
o
o
o
o
ooo o
o
oo
oo
o
o
o
o
ooo o
ooo o
o
o
o
o
o oo
o
ooo
o
o
oo
oo
o o
o
oo
o
oo
o o
o
o
o
oo
o
oo
o
ooo oo
o
oo
o
o
o
o ooo
oo ooooooo
o
o o
o
oo
oo
oo o
oo
o
o
o
o oo o
o
oo
oo
o
o
o
o
ooo oooo o
o
o
o
o
oooo
ooo
o
o
ooo
ooo
o
ooo
oo
oo
o
o
o
oo
o
oo
o
ooo oo
o
oo
o
o
o
o ooo
oo oo ooooo
o
oo
o
oo
oo
ooo
oo
o
o
o
o oo o
o
oo
oo
o
o
o
o
ooo o
oooo
o
o
o
o
o oo
o
ooo
o
o
oo
oo
o o
o
oo
o
oo
oo
oooooo
o
o
ooo
o
oooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
oo
o
o
o
oo
oo
o
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
o
o
oo
oo
oo
o
oo
o
o
oo
o
o
o
ooo
o
o
oooo oo
o
o
o oo
o
oooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
oo
o
o
o
oo
o o
o
o
o
o
oo
o
o
o
oo o
o
o
o
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
o
o
oo
oo
o o
o
oo
o
o
o o
o
o
o
o oo
o
o
o oo ooo
o
o
ooo
o
oo oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o oo
oo
o
o
o
oo
o o
o
o
o
o
oo
o
o
o
oo o
o
o
o
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
o
o
oo
oo
o o
o
oo
o
o
o o
o
o
o
ooo
o
o
o o oooo
o
o
o oo
o
o oo o
o
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
o o
o
o
o
oo
o o
o
o
o
o
oo
o
o
o
oo o
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
o
o
o
o
o o
oo
oo
o
oo
o
o
oo
o
o
o
oo o
o
o lbph
oooooo
o
o
ooo
o
oooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
oo
o
o
o
oo
oo
o
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
o
o
oo
oo
oo
o
oo
o
o
oo
o
o
o
ooo
o
o
oooooo
o
o
ooo
o
oo oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
o o
o
o
o
oo
oo
o
o
o
o
oo
o
o
o
oo o
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
oo
o o
o
oo
o
o
o o
o
o
o
o oo
o
o
oo oooo
o
o
ooo
o
oooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
oo
o
o
o
oo
oo
o
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
o
o
oo
oo
oo
o
oo
o
o
oo
o
o
o
ooo
o
o
-10
12
oo oooo
o
o
ooo
o
oooo
o
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
oo
o
o
o
oo
oo
o
o
o
o
oo
o
o
o
ooo
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
oo
o o
o
oo
o
o
o o
o
o
o
ooo
o
o
0.00.4
0.8
oooooooooooooooooooooooooooooooooooooo
o
ooooooo
o
oooooooooooooo
o
o
o
oooooo
o
o
oooo
oo
o
ooo
o
oo
o
o
ooo
o
oooooo
oooo oo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo
o
oo oo ooo
o
o ooo oo o ooooooo
o
o
o
oo ooo o
o
o
o o oo
oo
o
oo o
o
oo
o
o
o oo
o
oo ooo o
o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo
o
oo ooooo
o
ooo ooo oo oo o oo o
o
o
o
oooooo
o
o
oo oo
oo
o
oo o
o
oo
o
o
o oo
o
oooooo
o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo
o
o o ooo oo
o
oo oooo oo ooo oo o
o
o
o
oo o ooo
o
o
oo oo
o o
o
ooo
o
oo
o
o
oo o
o
o oo o oo
oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo
o
ooo oo oo
o
oo oo oooo oo oo o o
o
o
o
o o o oo o
o
o
o oo o
oo
o
o oo
o
o o
o
o
oo o
o
oooo oo
svi
oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo
o
ooo o ooo
o
oo ooo o ooo ooooo
o
o
o
o o ooo o
o
o
o o o o
oo
o
oo o
o
oo
o
o
o oo
o
o o ooo o
oo oooooooooo oooo ooooo oo ooo ooooooooo o oo
o
o ooo ooo
o
ooo oo oooooo ooo
o
o
o
o oooo o
o
o
o ooo
oo
o
ooo
o
oo
o
o
ooo
o
oooooo
oo oooooooooo oooo ooooo oo ooo oo oooooooooo
o
o oooo oo
o
ooo oo oooooo o oo
o
o
o
o o oooo
o
o
o oo o
oo
o
oo o
o
oo
o
o
o o o
o
o ooo oo
oooooooooooooo
o
oo
o
ooo
o
o
o
oooo
o
o
oooooo
o
o
o
o
oo
o
o
o
o
o
o
oooo
o
o
oo
o
oooo
ooo
o
o
o
o
oo
o
o
o
ooo
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
ooo
o
o
oooo oo ooo ooooo
o
oo
o
o o o
o
o
o
o oo
o
o
o
oo ooo
o
o
o
o
o
o o
o
o
o
o
o
o
ooo o
o
o
oo
o
oooo
o oo
o
o
o
o
oo
o
o
o
ooo
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
ooo
o
o
o oo ooooooooooo
o
oo
o
o oo
o
o
o
oooo
o
o
oooo
oo
o
o
o
o
o o
o
o
o
o
o
o
oo
oo
o
o
o o
o
o oo o
ooo
o
o
o
o
oo
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
ooo
o
o
o o oooo ooo oooo
o
o
oo
o
o oo
o
o
o
oooo
o
o
oo oo
oo
o
o
o
o
o o
o
o
o
o
o
o
oo
oo
o
o
o o
o
o oo o
o oo
o
o
o
o
oo
o
o
o
ooo
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
oo o
o
o
oooooo o oooo ooo
o
oo
o
o oo
o
o
o
ooo
o
o
o
oooo
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
o o
o
o
o o
o
oo o o
ooo
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o o
o
o
o
o
o
o
o
o
o
o o
ooo
o
o
oooooooooooooo
o
oo
o
ooo
o
o
o
oooo
o
o
oooooo
o
o
o
o
oo
o
o
o
o
o
o
oooo
o
o
oo
o
oooo
ooo
o
o
o
o
oo
o
o
o
ooo
o
o
o
oo
o
o
o
o
o
o
o
o
o
o o
ooo
o
o
lcp
oo oooooooooooo
o
oo
o
ooo
o
o
o
oooo
o
o
ooooo
o
o
o
o
o
oo
o
o
o
o
o
o
oo
oo
o
o
oo
o
o ooo
o oo
o
o
o
o
oo
o
o
o
ooo
o
o
o
oo
o
o
o
o
o
o
o
o
o
o o
ooo
o
o
-10
12
3
oo ooooooooooo
o
o
oo
o
ooo
o
o
o
ooo
o
o
o
oooooo
o
o
o
o
oo
o
o
o
o
o
o
oo
oo
o
o
oo
o
o o oo
o oo
o
o
o
o
oo
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o o
ooo
o
o
6.07.0
8.09.0
oo
o
ooooooooo
ooo
o
o
oooo
o
o
o
oo
ooo
oooooo
o
o
ooo
o
o
o
ooo
o
o
oo
o
o
ooooo
o
oo
o
o
o
o
o
ooo
o
oooo
o
ooooooooo
o
oo
o
ooo
o
oooooo
oo
o
o oo ooo ooo
ooo
o
o
oo o o
o
o
o
o o
oo o
ooo ooo
o
o
o oo
o
o
o
ooo
o
o
oo
o
o
o o ooo
o
oo
o
o
o
o
o
o oo
o
o ooo
o
ooooooo oo
o
o o
o
o oo
o
oo ooo o
o o
o
ooooooooo
oo o
o
o
oo oo
o
o
o
oo
ooo
o o oooo
o
o
o oo
o
o
o
ooo
o
o
oo
o
o
o oo oo
o
oo
o
o
o
o
o
ooo
o
ooo o
o
oo ooooo o o
o
oo
o
o oo
o
oooooo
o o
o
ooo ooo ooo
o oo
o
o
oo oo
o
o
o
oo
ooo
ooo oo o
o
o
o oo
o
o
o
o oo
o
o
o o
o
o
o oo oo
o
oo
o
o
o
o
o
o o o
o
oo oo
o
oo o ooooo o
o
o o
o
oo o
o
o oo o oo
oo
o
ooo o oooo o
ooo
o
o
oo oo
o
o
o
oo
o oo
o ooooo
o
o
o oo
o
o
o
o oo
o
o
o o
o
o
ooo oo
o
o o
o
o
o
o
o
o o o
o
oo oo
o
o o oooo ooo
o
oo
o
oo o
o
oooo oo
oo
o
ooooooooo
ooo
o
o
oooo
o
o
o
oo
ooo
oooooo
o
o
o oo
o
o
o
ooo
o
o
oo
o
o
ooooo
o
oo
o
o
o
o
o
ooo
o
o oo o
o
oooo oooo o
o
o o
o
ooo
o
oooooo
oo
o
ooooooooo
oo o
o
o
oooo
o
o
o
oo
oo o
ooooo o
o
o
o oo
o
o
o
ooo
o
o
o o
o
o
o ooo o
o
oo
o
o
o
o
o
o oo
o
o oo o
o
o ooo ooo oo
o
o o
o
o oo
o
o o ooo ogleason
oo
o
ooooooooo
ooo
o
o
oooo
o
o
o
oo
oo o
oooooo
o
o
o oo
o
o
o
o oo
o
o
oo
o
o
ooooo
o
o o
o
o
o
o
o
o oo
o
o ooo
o
o ooo ooo oo
o
o o
o
o o o
o
o ooo oo
0 2 4oooooooooooo
o
ooo
o
oooooo
o
oo
o
o
o
oooooooooo
o
o
ooooo
o
o
oo
o
o
o
o
oooooo
o
o
o
o
oo
o
ooo
o
oo
o
o
oo
o
oooo
o
o
o
o
o
o
oo
oo
oo
o
o
oooo
o oo ooo ooo
o
ooo
o
oo o oo
o
o
o o
o
o
o
ooo ooo ooo
o
o
o
oo o
oo
o
o
oo
o
o
o
o
oo
ooo
o
o
o
o
o
oo
o
oo o
o
oo
o
o
oo
o
oo
oo
o
o
o
o
o
o
oo
oo
oo
o
o
o
3 4 5 6o oo
ooooooooo
o
o oo
o
oo oooo
o
oo
o
o
o
o o oooo ooo
o
o
o
ooo
oo
o
o
oo
o
o
o
o
oo
o oo
o
o
o
o
o
oo
o
ooo
o
o o
o
o
oo
o
oo
oo
o
o
o
o
o
o
oo
oo
oo
o
o
oo o
oooo ooo ooo
o
oo o
o
oo oooo
o
oo
o
o
o
ooo oo oooo
o
o
o
ooo
oo
o
o
o o
o
o
o
o
oo
ooo
o
o
o
o
o
oo
o
ooo
o
oo
o
o
oo
o
ooo
o
o
o
o
o
o
o
oo
oo
oo
o
o
o
-1 0 1 2oooooo o oooo o
o
ooo
o
oo ooo
o
o
oo
o
o
o
o ooooo ooo
o
o
o
ooooo
o
o
o o
o
o
o
o
oo
o oo
o
o
o
o
o
oo
o
oo o
o
oo
o
o
oo
o
oo
oo
o
o
o
o
o
o
oo
oo
oo
o
o
ooooooooooooo
o
ooo
o
oooooo
o
oo
o
o
o
ooooooooo
o
o
o
ooooo
o
o
oo
o
o
o
o
oooooo
o
o
o
o
oo
o
ooo
o
o o
o
o
oo
o
oooo
o
o
o
o
o
o
oo
oo
oo
o
o
o
-1 1 2 3oooooooooooo
o
o oo
o
ooooo
o
o
oo
o
o
o
ooooo ooooo
o
o
oo ooo
o
o
o o
o
o
o
o
oo
oooo
o
o
o
o
oo
o
oo o
o
o o
o
o
oo
o
ooo
o
o
o
o
o
o
o
oo
oo
oo
o
o
ooo
oooooooooo
o
ooo
o
ooooo
o
o
oo
o
o
o
oooooo ooo
o
o
o
oo o
oo
o
o
oo
o
o
o
o
oooooo
o
o
o
o
oo
o
oo o
o
oo
o
o
oo
o
oooo
o
o
o
o
o
o
oo
oo
oo
o
o
o
0 40 80
020
60100
pgg45
Figure 1.1: S atterplot matrix of the prostate an erdata. The �rst row shows the response against ea h ofthe predi tors in turn. Two of the predi tors, svi andgleason, are ategori al.
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 17 / 26
IntroductionExamples
Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3
Shrinkage Factor s
Coeffi
cients
0.0 0.2 0.4 0.6 0.8 1.0
-0.20.0
0.20.4
0.6
•
•
•
•
•
•
••
•• • • • • • • • • • • • • • • • lcavol
• • • • ••
••
•• • • • • • • • • • • • • • • • lweight
• • • • • • • • • • • • • ••
• • • • • • • • • •age
• • • • • • • • • ••
••
•• • • • • • • • • • • lbph
• • • • • • ••
••
••
•• • • • • • • • • • • •svi
• • • • • • • • • • • • • • ••
••
••
••
••
• lcp
• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •
••
•• • • • • • • • • •
••pgg45
Figure 3.9: Pro�les of lasso oeÆ ients, as tuningparameter t is varied. CoeÆ ients are plotted versuss = t=Pp1 j�̂j j. A verti al line is drawn at s = 0:5, thevalue hosen by ross-validation. Compare Figure 3.7on page 7; the lasso pro�les hit zero, while those forridge do not.
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 17 / 26
IntroductionExamples
Handwritten Digit Recognition
Handwritten Digit Recognition (Textbook page 4)
Goal: identify single digits 0 ∼ 9 based on the images.
Raw Data: images that are scaled segments from five digitZIP codes.
Each digit has an image of 16× 16 eight-bit gray scale mapsPixel intensities range from 0 (black) to 255 (white)Images are normalized to have approximately the same sizeand orientation
Input X is a 16× 16 matrix, or a 256 dimensional vector.
Output G ∈ G = {0, 1, ..., 9}.The error rate should be very low to avoid misdirection of mail.Some objects are assigned to “do not know” category and sortedby hand.
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 18 / 26
IntroductionExamples
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 19 / 26
IntroductionExamples
Example 4: DNA Expression Microarray
DNA = Deoxy-ribo-nucleic acid, a basic material making uphuman chromosomes.DNA Microarray Technique: for each sample from a tissue, theexpression level (the amount of mRNA) of thousands of genes aremeasured.
Training data: p = 6, 830 genes (rows), n = 64 samples(columns) (cancer tumors) taken from two classes.
Input X: the level of expression for each geneGoal: discover the relationship between gene and cancer type,or find the gene signature of each cancer subtype
Challenge: p >> n (data matrix n × p “short and fat”’)
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 20 / 26
IntroductionExamples
How DNA Microarray Technique Works
A breakthrough technology in biology, facilitating the quantitativestudy of thousands of genes simultaneously from a single sample ofcells
1 The nucleotide sequences for a few thousand genes are printedon a glass slide;
2 A target sample and a reference sample are labeled with redand green dyes, and each are hybridized with the DNA on theslide;
3 Through fluroscopy, the log (red/green) intensities of RNAhybridizing each site is measured;
4 The results are a few thousand numbers, typically rangingfrom -6 to 6, measuring the expression level of each gene inthe target relative to the reference sample (positive valuesindicate higher expression in the target versus the reference,and vice versa for negative values).
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 21 / 26
IntroductionExamplesElements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 1
SID42354SID31984SID301902SIDW128368SID375990SID360097SIDW325120ESTsChr.10SIDW365099SID377133SID381508SIDW308182SID380265SIDW321925ESTsChr.15SIDW362471SIDW417270SIDW298052SID381079SIDW428642TUPLE1TUP1ERLUMENSIDW416621SID43609ESTsSID52979SIDW357197SIDW366311ESTsSMALLNUCSIDW486740ESTsSID297905SID485148SID284853ESTsChr.15SID200394SIDW322806ESTsChr.2SIDW257915SID46536SIDW488221ESTsChr.5SID280066SIDW376394ESTsChr.15SIDW321854WASWiskottHYPOTHETICALSIDW376776SIDW205716SID239012SIDW203464HLACLASSISIDW510534SIDW279664SIDW201620SID297117SID377419SID114241ESTsCh31SIDW376928SIDW310141SIDW298203PTPRCSID289414SID127504ESTsChr.3SID305167SID488017SIDW296310ESTsChr.6SID47116MITOCHONDRIAL60ChrSIDW376586HomosapiensSIDW487261SIDW470459SID167117SIDW31489SID375812DNAPOLYMERSID377451ESTsChr.1MYBPROTOSID471915ESTsSIDW469884HumanmRNASIDW377402ESTsSID207172RASGTPASESID325394H.sapiensmRNAGNALSID73161SIDW380102SIDW299104
BREA
STRE
NAL
MELA
NOMA
MELA
NOMA
MCF7
D-rep
roCO
LON
COLO
NK5
62B-r
epro
COLO
NNS
CLC
LEUK
EMIA
RENA
LME
LANO
MABR
EAST CN
SCN
SRE
NAL
MCF7
A-rep
roNS
CLC
K562
A-rep
roCO
LON
CNS
NSCL
CNS
CLC
LEUK
EMIA
CNS
OVAR
IANBR
EAST
LEUK
EMIA
MELA
NOMA
MELA
NOMA
OVAR
IANOV
ARIAN
NSCL
CRE
NAL
BREA
STME
LANO
MAOV
ARIAN
OVAR
IANNS
CLC
RENA
LBR
EAST
MELA
NOMA
LEUK
EMIA
COLO
NBR
EAST
LEUK
EMIA
COLO
NCN
SME
LANO
MANS
CLC
PROS
TATE
NSCL
CRE
NAL
RENA
LNS
CLC
RENA
LLE
UKEM
IAOV
ARIAN
PROS
TATE
COLO
NBR
EAST
RENA
LUN
KNOW
N
Figure 1.3: DNA mi roarray data: expression matrix of6830 genes (rows) and 64 samples ( olumns), for the humantumor data. Only a random sample of 100 rows are shown.The display is a heat map, ranging from bright green (nega-tive, under expressed) to bright red (positive, over expressed).Missing values are gray. The rows and olumns are displayedin a randomly hosen order.Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 22 / 26
IntroductionExamples
Microarray Data Analysis
Typical questions of interest:
Which samples are most similar to each other, in terms oftheir expression profiles across genes?
Which genes are most similar to each other, in terms of theirexpression profiles across sample?
Do certain genes show very high (or low) expression forcertain cancer samples?
.......
This task could be viewed as
a supervised or an unsupervised problem
a regression or a classification problem
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 23 / 26
IntroductionExamples
Example: Direct Mailing Prediction
Interested in predicting how much money an individual willdonate?
Data: observations from 90,000 people on which we haverecorded over 400 different characteristics.
Decision: For a given individual should I send out a mailing?
Do not care too much about each individual characteristic.
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 24 / 26
IntroductionExamples
Statistical Problems in Data Mining
Regression (linear, nonlinear, logistic regression, GLM,graphical model)
Classification (Discriminant Analysis; LDA, QDA, SVM, trees,random forest)
Feature Selection (Variable Selection; Sparse Analysis;LASSO, screening)
Dimension Reduction (PCA; ICA; SDR)
Regularization (control model complexity, aim forparsimonious and interpretable solutions)
Clustering Analysis (k-means, association role)
Other Variations
Semi-supervised Learning
Mixture of the above
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 25 / 26
IntroductionExamples
Next Class
Introduction to R and RStudio
Please install RStudio on your laptop or PC
Please read one tutorial about R (available at our course page)
Statistical Decision Theory
Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 26 / 26