27
Introduction Examples 574M: Introduction to Statistical Machine Learning Hao Helen Zhang Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 1 / 26

574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

574M: Introduction to Statistical MachineLearning

Hao Helen Zhang

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 1 / 26

Page 2: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Course Information

Course homepage:http://www.math.arizona.edu/∼hzhang/math574m.html

Prerequisite:

MATH 464, MATH 466, or equivalentProgramming language: R (recommended)

Textbooks: The Elements of Statistical Learning (electronicversion available at course website)

Reference Books:1 Principle and Theory for Data Mining and Machine Learning

by Clark, Forkoue, Zhang (2009)2 Pattern Recognition and Neural Networks by B. Ripley (1996)3 Learning with Kernels by Scholkopf and Smola (2000)4 Nature of Statistical Learning Theory by Vapnik (1998)

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 2 / 26

Page 3: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Big Data Features and Challenges

Big data are featured with

massive volume, ultra-high dimension

complex structure, heterogeneous

multiple-source, multiple-type data

Big Data Challenges:

data storage, transfer [enormous memory, parallel processing],search, retrieval [at a high speed]

data cleaning, organization, visualization [present/display datain a meaningful way]

data analysis [gain deep insight about data]

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 3 / 26

Page 4: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Data Scientists

Data Science is an interdisciplinary field:

Data Analyst: visualize, analyze data, optimization

Data Engineers: build and test scalable/stable/optimal BigData ecosystems for data scientists to run their algorithms onthe data systems

Database Administrator: responsible for the properfunctioning of all the databases

Data Scientist: perform predictive analysis and offeractionable insights.

Statistician: extract and offer valuable insights from the datausing statistical theory and tools

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 4 / 26

Page 5: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Statistical Machine Learning (ML)

Plays a central role in data mining.

extract patterns from data using statistical and probabilistictheory and tools

provide theoretical foundations for learning algorithms

develop useful tools to analyze an algorithm’sstatistical properties and performance guarantee

help researchers gain deeper understanding of the approaches,design better algorithms, and select appropriate methods for agiven problem

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 5 / 26

Page 6: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

What is Special about this course

To combine the “art” of designing good learning algorithms andthe “science” of analyzing statistical properties and performance ofthe approaches

emphasize statistical principles behind the approaches andalgorithms

learn how to formulate a learning problem in a statisticalframework

understand existing techniques from a statistical perspective:what, why, and when

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 6 / 26

Page 7: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

New Challenges to Statistical ML

Data Complexity: involves many variables which are oftenrelated in complex (nonlinear) ways.

Feature Selection: many features are available but some areredundant, leading to the feature selection or dimensionreduction problem.

Optimization: many methods involve finding the “best”parameters values by solving complex and large (containingmany parameters) optimization problems. Therefore, efficientoptimization techniques are required.

Visualization: much harder in the high dimensional space.

This is the so-called curse of dimensionality.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 7 / 26

Page 8: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Course Goals

understand different types of data mining problems

supervised, unsupervised, semi-supervised learningregression, classification, clustering, graphical modelstext mining, multimedia mining, image recognition, socialnetworks, recommendation system, network models

learn basic statistical concepts and principles (which arefavored over a black-box learning algorithm)

statistical inference on uncertainty, distribution, loss, riskmodel building, evaluation, selection, prediction, andoptimality; bias-variance tradefoffparameter tuning, training error, test error, cross-validation,

learn statistical and machine learning methods for big data

SVM, PCA, lasso, boosting, tree, random forest

learn R software packages to analyze data

take into serious consideration scalable, parallel algorithms

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 8 / 26

Page 9: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Terminologies

Statistics Machine Learning

X predictor, covariate input, feature

Y response, outcome output

{(Xi ,Yi )}ni=1 data points, samples examples, training data

f̂ (X) model fitting machine learning

data analysis consistency, inference prediction, speedgoals confidence interval risk bound

convergence rate learning theory

Practical vs Reluctant to use methods Willing to use ad hocTheoretical methods without theoretical methods if they seemsConcerns justification (even if the to work well (though

justification is appearances may beactually meaningless) misleading)

Computing more and more important heavy

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 9 / 26

Page 10: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Different Learning Problems

Typically, we collect data (Xi ,Yi ), i = 1, · · · , n.

Yi is the outcome or response variable.

Xi is the input or prediction variables.

Various Machine Learning Problems

Supervised learning (Y observed)

Unsupervised learning (Y unobserved)

Semi-supervised learning (Y partially available)

Various Statistical problems

Regression (Y quantitative, a numerical quantity)

Classification (Y qualitative; a class label)

Density estimation (no Y )

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 10 / 26

Page 11: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Supervised Learning vs Unsupervised Learning

Supervised Learning Unsupervised Learning

Response Y observed unobserved

Major predict Y , given find interestingGoal the observed input X patterns in data

Examples linear regression clusteringnonparametric regression density estimation

classification dimension reduction

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 11 / 26

Page 12: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 12 / 26

Page 13: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Which Type of Learning Problems

Predict whether a patient, hospitalized due to a heart attack,will have a second heart attack, based on demographic, dietand clinical measurements

Predict a stock price based on company performance andeconomic data

Identify a handwritten ZIP code, from a digitized image

Classify a recorded phoneme based on a log-periodogram

Identify cancer risk factors based on clinical, genetic anddemographic variables

Identify groups of genes with similar functions from DNAmicroarray data

Establish the relationship between salary and demographicvariables in population survey.

In fraud detection, predict the probability of a fraud order

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 13 / 26

Page 14: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Example 1: Email or Spam (Textbook Page 2)

Training data: 4, 601 email messages with known email type.

Input X: the relative frequencies of 57 of the most commonlyoccurring words and punctuation marks in the email message.Outcome Y : −1 =email , + =spam

Goal: Design an automatic spam detector to predict whethera message is email or spam. In the future, the detector can beused to filter out the junk emails before clogging the users’mailbox.

Supervised learning, binary classification, n > p. (data matrixn × p “long and narrow”)

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 14 / 26

Page 15: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Table: Average percentage of words (the largest difference between spam and email

george you your hp free hpl ! our re edu removespam 0.00 2.26 1.38 0.02 0.52 0.01 0.51 0.51 0.13 0.01 0.28email 1.27 1.27 0.44 0.90 0.07 0.43 0.11 0.18 0.42 0.29 0.01

Possible classification rules:

if (george < 0.6) & (you > 1.5), then spam; otherwise email.

if (0.2 you -0.3 george)> 0, then spam; otherwise email.

Two types of decision errors:

(1) false positive: classify email to spam (filter out email)

(2) false negative: classify spam to email (email box jammed)

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 15 / 26

Page 16: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Example 2: Prostate Cancer

Read Textbook page 3

Training data: 97 male patient with different stages ofprostate cancer.

Input X: eight clinical measures: log-cancer volume (lcavol),log prostate weight (lweight), age, and the other five.

Outcome Y : the log of the level of prostate specific antigen(lpsa)

Questions of interest:

What is the relationship between lpsa and clinical measures?

Is the linear model sufficient? Nonlinear effect? Interactions?

Which clinical measures are more relevant to the prediction?

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 16 / 26

Page 17: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamplesElements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 1

lpsa

-1 1 3

oooooo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo ooo oo ooo oo ooo oo o ooooooo o oooo ooo o ooo o ooooooo oooo ooo oo ooo

ooo o

o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo ooo ooooo oooo ooo oo oo o oo ooo ooooooooo oo oo ooooo o oooooo oo ooo

oooo40 60 80

o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo oo o ooo oo ooo oooo oo ooo oo oo oooo o oooo ooo oo o ooooo ooo oooo ooo oo o oo

oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo oooo oo oo ooo oo oooo oo oo o ooo oo o o oo oo oo oo o oooo ooo o oo ooo oo oooo oo

0.0 0.4 0.8

oooooooooooooooooooooooooooooooooooooo oooooooo ooooooooooooooo oo ooooooo oo oooooo oooo ooo oo oooo oo

oooo

oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo oooo o ooo ooo ooo o ooo ooooo oo oo o ooo o oo o o o ooo ooo oo oo oo o oooo o ooo o

6.0 7.5 9.0

oo oooooooooo oooo ooooo oo ooo ooooooooo o oooo ooo ooo oooo oo oooooo ooo o ooo oooo oooo oooooooooo oooo oooo oo

oooo

0123

45

oo oooooooooo oooo ooooo oo ooo oo oooooooooo oo oooo oo oooo oo oooooo o oo o ooo o oooo ooo oo ooo ooo oo oo oo o o oo o ooo oo

-10

12

34

oooo

o

o

oo

ooo

o

oooo

o

o

oooo

o

o

oooo

o

o

oo

o

oo

ooo

o

oooo

oooo

ooooo

o

oo

oooooo

ooooooo

o

ooooo

oo

oooo

oooo

o

o

oo

o

o

ooo

oooo

lcavol

o oo

o

o

o

oo

ooo

o

oo oo

o

o

oo

oo

o

o

oo

oo

o

o

o o

o

oo

ooo

o

oooo

oooo

ooo o

o

o

oo

ooo oo

o

oo

ooooo

o

oo

o oo

oo

oooo

oo oo

o

o

oo

o

o

oooo

ooo

o oo

o

o

o

oo

ooo

o

o oo o

o

o

oo

oo

o

o

oo

oo

o

o

oo

o

o o

o oo

o

o oo

o

o ooo

oo

ooo

o

oo

ooo o

oo

oo

ooo

o o

o

oo

ooo

oo

oooo

oo oo

o

o

oo

o

o

oo o

oo o

o

oooo

o

o

o o

ooo

o

oooo

o

o

oo

oo

o

o

oo

oo

o

o

oo

o

oo

ooo

o

ooo

o

o ooo

oo

ooo

o

oo

ooooo

o

oo

ooo

o o

o

oo

ooo

o o

oooo

ooo o

o

o

oo

o

o

ooo

oo o

o

oooo

o

o

oo

ooo

o

oooo

o

o

oooo

o

o

oooo

o

o

oo

o

oo

ooo

o

oooo

oooo

ooooo

o

oo

oooooo

oo

ooooo

o

oo

o oo

oo

oo oo

oo oo

o

o

o o

o

o

ooooooo

oooo

o

o

oo

ooo

o

oo oo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

o o

o oo

o

ooo

o

oooo

oo

ooo

o

oo

o ooooo

oo

ooo

oo

o

oo

o oo

o o

oo oo

ooo o

o

o

o o

o

o

oo o

ooo

o

ooo

o

o

o

oo

ooo

o

oooo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

oo

o oo

o

o oo

o

oooo

ooo oo

o

oo

ooo o

oo

oo

ooooo

o

oooo

o

oo

oooo

ooo o

o

o

o o

o

o

ooooooo

ooo

o

o

o

oo

ooo

o

oooo

o

o

ooo

o

o

o

oo

oo

o

o

oo

o

oo

ooo

o

o oo

o

o ooo

ooo oo

o

oo

ooo o

oo

oo

ooo

oo

o

oo

ooo

o o

oo oo

ooo o

o

o

o o

o

o

oo o

oo o

o

oooooooooo

ooooooooooooooooooooo

o

ooo

oo

o

o

ooooooooooooooooo

o

oooo

ooooooooo

oooooo

oooooooooooo

o

oooooo

oo

oo

oo oo ooo o

oooo

oo

o oo

oo oo ooo

ooo o

o

o

ooo

oo

o

o

oooo ooo

oooo

o oo

oo

o

o

oo

oo

o oooo oo

oo

ooo

ooo

oooo

ooooo oo

o

o

oo

oo ooo o lweight

oo

oooo ooo o

ooo o

oo

ooo

oooo o o

ooooo

o

o

oo o

oo

o

o

o oooo

oooo

o ooo

oo

oo

o

oo

oo

o oooo o ooo

o oo

ooo

o ooo

oo

ooo ooo

o

oo

o ooo

oo

oooooo o oooo ooooo

ooo

oo oo o o

oo o

ooo

o

ooo

oo

o

o

ooo oo

oooo

o oo o

oo

oo

o

oo

oo

oooo o o o

oo

o oo

oo

o

oooo

oo

o o oo oo

o

oo

oooo

oo

ooooooooooooooooooooooooooooooo

o

ooo

oo

o

o

ooooooo

oooooooooo

o

oooo

oooooo

ooo

oooooo

oooo

oo

ooo ooo

o

oo

oooooo

oooooooooooo

ooo

oo o

ooo oo ooo

ooo o

o

o

oo o

oo

o

o

ooo o o

oooo

o ooo

oo

oo

o

oooo

oooo o oo

oo

ooo

oo

o

oooo

oo

o oo ooo

o

ooo o o

oo o

oo

oooooooooo

ooo

ooo

ooo oo ooo

oooo

o

o

ooo

oo

o

o

o ooo ooo

oooo

oooooo

o

ooo

o

o ooo ooo

oo

ooo

ooo

ooooooo ooooo

o

oo

oooooo

34

56

oo

oooooooooo

ooo

ooo

ooo oo ooo

oooo

o

o

ooo

oo

o

o

o oooo

oooo

oooo

oo

oo

o

oo

oo

o ooo o oo

oo

ooo

oo

o

oooo

oo

o oo ooo

o

oo

o ooo

oo40

5060

7080

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

ooooooooooooo

o

oo

o

oo

oo

ooooo

o

o

o

ooooo

oo

oo

o

o

o

o

oooooooo

o

o

o

o

oooo

ooo

o

o

oooooo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

o ooo o oooo

o

o o

o

oo

oo

oo o

oo

o

o

o

oooo

o

oo

oo

o

o

o

o

ooooooo o

o

o

o

o

ooo

o

ooo

o

o

oo

oo

o o

o

ooo

oo

o o

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

oooooo oooo o o

o

o

o o

o

oo

oo

ooo

oo

o

o

o

o oooo

oo

o o

o

o

o

o

oooooooo

o

o

o

o

oooo

ooo

o

o

oo

oo

o o

o

ooo

oo

oo

ageo

o

o

oo

o

oo

o

oo ooo

o

oo

o

o

o

o ooo

ooo ooo ooo

o

o o

o

oo

oo

oooo

o

o

o

o

oooo

o

oo

o o

o

o

o

o

ooo o

ooo o

o

o

o

o

o ooo

oo o

o

o

oo

oo

oo

o

oo

o

oo

oo

o

o

o

oo

o

oo

o

ooooo

o

ooo

o

o

ooooooooooooo

o

oo

o

oo

oo

ooooo

o

o

o

ooooo

oo

oo

o

o

o

o

oooo

oooo

o

o

o

o

oooo

ooo

o

o

oo

oo

oo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo oo o oooo

o

oo

o

oo

oo

oo oo

o

o

o

o

ooo o

o

oo

oo

o

o

o

o

ooo o

ooo o

o

o

o

o

o oo

o

ooo

o

o

oo

oo

o o

o

oo

o

oo

o o

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo ooooooo

o

o o

o

oo

oo

oo o

oo

o

o

o

o oo o

o

oo

oo

o

o

o

o

ooo oooo o

o

o

o

o

oooo

ooo

o

o

ooo

ooo

o

ooo

oo

oo

o

o

o

oo

o

oo

o

ooo oo

o

oo

o

o

o

o ooo

oo oo ooooo

o

oo

o

oo

oo

ooo

oo

o

o

o

o oo o

o

oo

oo

o

o

o

o

ooo o

oooo

o

o

o

o

o oo

o

ooo

o

o

oo

oo

o o

o

oo

o

oo

oo

oooooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

oooo oo

o

o

o oo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

o oo

o

o

o oo ooo

o

o

ooo

o

oo oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o oo

oo

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

ooo

o

o

o o oooo

o

o

o oo

o

o oo o

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o o

o

o

o

oo

o o

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

oo

oo

o

oo

o

o

oo

o

o

o

oo o

o

o lbph

oooooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

oooooo

o

o

ooo

o

oo oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

o o

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

oo o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

o oo

o

o

oo oooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

o

o

oo

oo

oo

o

oo

o

o

oo

o

o

o

ooo

o

o

-10

12

oo oooo

o

o

ooo

o

oooo

o

oo

o

o

o

o

o

o

o

o

o

o

o

ooo

oo

o

o

o

oo

oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo

o o

o

oo

o

o

o o

o

o

o

ooo

o

o

0.00.4

0.8

oooooooooooooooooooooooooooooooooooooo

o

ooooooo

o

oooooooooooooo

o

o

o

oooooo

o

o

oooo

oo

o

ooo

o

oo

o

o

ooo

o

oooooo

oooo oo ooo ooo ooooo oo o o oo oo ooo o ooo ooo ooo

o

oo oo ooo

o

o ooo oo o ooooooo

o

o

o

oo ooo o

o

o

o o oo

oo

o

oo o

o

oo

o

o

o oo

o

oo ooo o

o oo ooooooooooo oo ooo ooooooo oooo o oooo ooo

o

oo ooooo

o

ooo ooo oo oo o oo o

o

o

o

oooooo

o

o

oo oo

oo

o

oo o

o

oo

o

o

o oo

o

oooooo

o o oooo ooo oooo oo o ooo oooo o oooooooo oo oo oo

o

o o ooo oo

o

oo oooo oo ooo oo o

o

o

o

oo o ooo

o

o

oo oo

o o

o

ooo

o

oo

o

o

oo o

o

o oo o oo

oooooo o oooo ooooo ooo oo oo o ooo ooo ooooo ooo

o

ooo oo oo

o

oo oo oooo oo oo o o

o

o

o

o o o oo o

o

o

o oo o

oo

o

o oo

o

o o

o

o

oo o

o

oooo oo

svi

oooooooooooo oo oo o oooo oo ooo oo o ooooo oo oo

o

ooo o ooo

o

oo ooo o ooo ooooo

o

o

o

o o ooo o

o

o

o o o o

oo

o

oo o

o

oo

o

o

o oo

o

o o ooo o

oo oooooooooo oooo ooooo oo ooo ooooooooo o oo

o

o ooo ooo

o

ooo oo oooooo ooo

o

o

o

o oooo o

o

o

o ooo

oo

o

ooo

o

oo

o

o

ooo

o

oooooo

oo oooooooooo oooo ooooo oo ooo oo oooooooooo

o

o oooo oo

o

ooo oo oooooo o oo

o

o

o

o o oooo

o

o

o oo o

oo

o

oo o

o

oo

o

o

o o o

o

o ooo oo

oooooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

oo

o

oooo

ooo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

oooo oo ooo ooooo

o

oo

o

o o o

o

o

o

o oo

o

o

o

oo ooo

o

o

o

o

o

o o

o

o

o

o

o

o

ooo o

o

o

oo

o

oooo

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

o oo ooooooooooo

o

oo

o

o oo

o

o

o

oooo

o

o

oooo

oo

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o oo o

ooo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

ooo

o

o

o o oooo ooo oooo

o

o

oo

o

o oo

o

o

o

oooo

o

o

oo oo

oo

o

o

o

o

o o

o

o

o

o

o

o

oo

oo

o

o

o o

o

o oo o

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

oo o

o

o

oooooo o oooo ooo

o

oo

o

o oo

o

o

o

ooo

o

o

o

oooo

oo

o

o

o

o

oo

o

o

o

o

o

o

oo

o o

o

o

o o

o

oo o o

ooo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o o

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

oooooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oooo

o

o

oo

o

oooo

ooo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

lcp

oo oooooooooooo

o

oo

o

ooo

o

o

o

oooo

o

o

ooooo

o

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o ooo

o oo

o

o

o

o

oo

o

o

o

ooo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

-10

12

3

oo ooooooooooo

o

o

oo

o

ooo

o

o

o

ooo

o

o

o

oooooo

o

o

o

o

oo

o

o

o

o

o

o

oo

oo

o

o

oo

o

o o oo

o oo

o

o

o

o

oo

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o o

ooo

o

o

6.07.0

8.09.0

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

ooo

oooooo

o

o

ooo

o

o

o

ooo

o

o

oo

o

o

ooooo

o

oo

o

o

o

o

o

ooo

o

oooo

o

ooooooooo

o

oo

o

ooo

o

oooooo

oo

o

o oo ooo ooo

ooo

o

o

oo o o

o

o

o

o o

oo o

ooo ooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

o o ooo

o

oo

o

o

o

o

o

o oo

o

o ooo

o

ooooooo oo

o

o o

o

o oo

o

oo ooo o

o o

o

ooooooooo

oo o

o

o

oo oo

o

o

o

oo

ooo

o o oooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

o oo oo

o

oo

o

o

o

o

o

ooo

o

ooo o

o

oo ooooo o o

o

oo

o

o oo

o

oooooo

o o

o

ooo ooo ooo

o oo

o

o

oo oo

o

o

o

oo

ooo

ooo oo o

o

o

o oo

o

o

o

o oo

o

o

o o

o

o

o oo oo

o

oo

o

o

o

o

o

o o o

o

oo oo

o

oo o ooooo o

o

o o

o

oo o

o

o oo o oo

oo

o

ooo o oooo o

ooo

o

o

oo oo

o

o

o

oo

o oo

o ooooo

o

o

o oo

o

o

o

o oo

o

o

o o

o

o

ooo oo

o

o o

o

o

o

o

o

o o o

o

oo oo

o

o o oooo ooo

o

oo

o

oo o

o

oooo oo

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

ooo

oooooo

o

o

o oo

o

o

o

ooo

o

o

oo

o

o

ooooo

o

oo

o

o

o

o

o

ooo

o

o oo o

o

oooo oooo o

o

o o

o

ooo

o

oooooo

oo

o

ooooooooo

oo o

o

o

oooo

o

o

o

oo

oo o

ooooo o

o

o

o oo

o

o

o

ooo

o

o

o o

o

o

o ooo o

o

oo

o

o

o

o

o

o oo

o

o oo o

o

o ooo ooo oo

o

o o

o

o oo

o

o o ooo ogleason

oo

o

ooooooooo

ooo

o

o

oooo

o

o

o

oo

oo o

oooooo

o

o

o oo

o

o

o

o oo

o

o

oo

o

o

ooooo

o

o o

o

o

o

o

o

o oo

o

o ooo

o

o ooo ooo oo

o

o o

o

o o o

o

o ooo oo

0 2 4oooooooooooo

o

ooo

o

oooooo

o

oo

o

o

o

oooooooooo

o

o

ooooo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

ooo

o

oo

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

oooo

o oo ooo ooo

o

ooo

o

oo o oo

o

o

o o

o

o

o

ooo ooo ooo

o

o

o

oo o

oo

o

o

oo

o

o

o

o

oo

ooo

o

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

o

3 4 5 6o oo

ooooooooo

o

o oo

o

oo oooo

o

oo

o

o

o

o o oooo ooo

o

o

o

ooo

oo

o

o

oo

o

o

o

o

oo

o oo

o

o

o

o

o

oo

o

ooo

o

o o

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

oo o

oooo ooo ooo

o

oo o

o

oo oooo

o

oo

o

o

o

ooo oo oooo

o

o

o

ooo

oo

o

o

o o

o

o

o

o

oo

ooo

o

o

o

o

o

oo

o

ooo

o

oo

o

o

oo

o

ooo

o

o

o

o

o

o

o

oo

oo

oo

o

o

o

-1 0 1 2oooooo o oooo o

o

ooo

o

oo ooo

o

o

oo

o

o

o

o ooooo ooo

o

o

o

ooooo

o

o

o o

o

o

o

o

oo

o oo

o

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oo

oo

o

o

o

o

o

o

oo

oo

oo

o

o

ooooooooooooo

o

ooo

o

oooooo

o

oo

o

o

o

ooooooooo

o

o

o

ooooo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

ooo

o

o o

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

-1 1 2 3oooooooooooo

o

o oo

o

ooooo

o

o

oo

o

o

o

ooooo ooooo

o

o

oo ooo

o

o

o o

o

o

o

o

oo

oooo

o

o

o

o

oo

o

oo o

o

o o

o

o

oo

o

ooo

o

o

o

o

o

o

o

oo

oo

oo

o

o

ooo

oooooooooo

o

ooo

o

ooooo

o

o

oo

o

o

o

oooooo ooo

o

o

o

oo o

oo

o

o

oo

o

o

o

o

oooooo

o

o

o

o

oo

o

oo o

o

oo

o

o

oo

o

oooo

o

o

o

o

o

o

oo

oo

oo

o

o

o

0 40 80

020

60100

pgg45

Figure 1.1: S atterplot matrix of the prostate an erdata. The �rst row shows the response against ea h ofthe predi tors in turn. Two of the predi tors, svi andgleason, are ategori al.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 17 / 26

Page 18: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Elements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 3

Shrinkage Factor s

Coeffi

cients

0.0 0.2 0.4 0.6 0.8 1.0

-0.20.0

0.20.4

0.6

••

•• • • • • • • • • • • • • • • • lcavol

• • • • ••

••

•• • • • • • • • • • • • • • • • lweight

• • • • • • • • • • • • • ••

• • • • • • • • • •age

• • • • • • • • • ••

••

•• • • • • • • • • • • lbph

• • • • • • ••

••

••

•• • • • • • • • • • • •svi

• • • • • • • • • • • • • • ••

••

••

••

••

• lcp

• • • • • • • • • • • • • • • • • • • • • • • • •gleason• • • • • • • • • •

••

•• • • • • • • • • •

••pgg45

Figure 3.9: Pro�les of lasso oeÆ ients, as tuningparameter t is varied. CoeÆ ients are plotted versuss = t=Pp1 j�̂j j. A verti al line is drawn at s = 0:5, thevalue hosen by ross-validation. Compare Figure 3.7on page 7; the lasso pro�les hit zero, while those forridge do not.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 17 / 26

Page 19: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Handwritten Digit Recognition

Handwritten Digit Recognition (Textbook page 4)

Goal: identify single digits 0 ∼ 9 based on the images.

Raw Data: images that are scaled segments from five digitZIP codes.

Each digit has an image of 16× 16 eight-bit gray scale mapsPixel intensities range from 0 (black) to 255 (white)Images are normalized to have approximately the same sizeand orientation

Input X is a 16× 16 matrix, or a 256 dimensional vector.

Output G ∈ G = {0, 1, ..., 9}.The error rate should be very low to avoid misdirection of mail.Some objects are assigned to “do not know” category and sortedby hand.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 18 / 26

Page 20: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 19 / 26

Page 21: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Example 4: DNA Expression Microarray

DNA = Deoxy-ribo-nucleic acid, a basic material making uphuman chromosomes.DNA Microarray Technique: for each sample from a tissue, theexpression level (the amount of mRNA) of thousands of genes aremeasured.

Training data: p = 6, 830 genes (rows), n = 64 samples(columns) (cancer tumors) taken from two classes.

Input X: the level of expression for each geneGoal: discover the relationship between gene and cancer type,or find the gene signature of each cancer subtype

Challenge: p >> n (data matrix n × p “short and fat”’)

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 20 / 26

Page 22: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

How DNA Microarray Technique Works

A breakthrough technology in biology, facilitating the quantitativestudy of thousands of genes simultaneously from a single sample ofcells

1 The nucleotide sequences for a few thousand genes are printedon a glass slide;

2 A target sample and a reference sample are labeled with redand green dyes, and each are hybridized with the DNA on theslide;

3 Through fluroscopy, the log (red/green) intensities of RNAhybridizing each site is measured;

4 The results are a few thousand numbers, typically rangingfrom -6 to 6, measuring the expression level of each gene inthe target relative to the reference sample (positive valuesindicate higher expression in the target versus the reference,and vice versa for negative values).

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 21 / 26

Page 23: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamplesElements of Statisti al Learning Hastie, Tibshirani & Friedman 2001 Chapter 1

SID42354SID31984SID301902SIDW128368SID375990SID360097SIDW325120ESTsChr.10SIDW365099SID377133SID381508SIDW308182SID380265SIDW321925ESTsChr.15SIDW362471SIDW417270SIDW298052SID381079SIDW428642TUPLE1TUP1ERLUMENSIDW416621SID43609ESTsSID52979SIDW357197SIDW366311ESTsSMALLNUCSIDW486740ESTsSID297905SID485148SID284853ESTsChr.15SID200394SIDW322806ESTsChr.2SIDW257915SID46536SIDW488221ESTsChr.5SID280066SIDW376394ESTsChr.15SIDW321854WASWiskottHYPOTHETICALSIDW376776SIDW205716SID239012SIDW203464HLACLASSISIDW510534SIDW279664SIDW201620SID297117SID377419SID114241ESTsCh31SIDW376928SIDW310141SIDW298203PTPRCSID289414SID127504ESTsChr.3SID305167SID488017SIDW296310ESTsChr.6SID47116MITOCHONDRIAL60ChrSIDW376586HomosapiensSIDW487261SIDW470459SID167117SIDW31489SID375812DNAPOLYMERSID377451ESTsChr.1MYBPROTOSID471915ESTsSIDW469884HumanmRNASIDW377402ESTsSID207172RASGTPASESID325394H.sapiensmRNAGNALSID73161SIDW380102SIDW299104

BREA

STRE

NAL

MELA

NOMA

MELA

NOMA

MCF7

D-rep

roCO

LON

COLO

NK5

62B-r

epro

COLO

NNS

CLC

LEUK

EMIA

RENA

LME

LANO

MABR

EAST CN

SCN

SRE

NAL

MCF7

A-rep

roNS

CLC

K562

A-rep

roCO

LON

CNS

NSCL

CNS

CLC

LEUK

EMIA

CNS

OVAR

IANBR

EAST

LEUK

EMIA

MELA

NOMA

MELA

NOMA

OVAR

IANOV

ARIAN

NSCL

CRE

NAL

BREA

STME

LANO

MAOV

ARIAN

OVAR

IANNS

CLC

RENA

LBR

EAST

MELA

NOMA

LEUK

EMIA

COLO

NBR

EAST

LEUK

EMIA

COLO

NCN

SME

LANO

MANS

CLC

PROS

TATE

NSCL

CRE

NAL

RENA

LNS

CLC

RENA

LLE

UKEM

IAOV

ARIAN

PROS

TATE

COLO

NBR

EAST

RENA

LUN

KNOW

N

Figure 1.3: DNA mi roarray data: expression matrix of6830 genes (rows) and 64 samples ( olumns), for the humantumor data. Only a random sample of 100 rows are shown.The display is a heat map, ranging from bright green (nega-tive, under expressed) to bright red (positive, over expressed).Missing values are gray. The rows and olumns are displayedin a randomly hosen order.Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 22 / 26

Page 24: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Microarray Data Analysis

Typical questions of interest:

Which samples are most similar to each other, in terms oftheir expression profiles across genes?

Which genes are most similar to each other, in terms of theirexpression profiles across sample?

Do certain genes show very high (or low) expression forcertain cancer samples?

.......

This task could be viewed as

a supervised or an unsupervised problem

a regression or a classification problem

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 23 / 26

Page 25: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Example: Direct Mailing Prediction

Interested in predicting how much money an individual willdonate?

Data: observations from 90,000 people on which we haverecorded over 400 different characteristics.

Decision: For a given individual should I send out a mailing?

Do not care too much about each individual characteristic.

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 24 / 26

Page 26: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Statistical Problems in Data Mining

Regression (linear, nonlinear, logistic regression, GLM,graphical model)

Classification (Discriminant Analysis; LDA, QDA, SVM, trees,random forest)

Feature Selection (Variable Selection; Sparse Analysis;LASSO, screening)

Dimension Reduction (PCA; ICA; SDR)

Regularization (control model complexity, aim forparsimonious and interpretable solutions)

Clustering Analysis (k-means, association role)

Other Variations

Semi-supervised Learning

Mixture of the above

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 25 / 26

Page 27: 574M: Introduction to Statistical Machine Learninghzhang/math574m/2020F_Lect... · 2020. 9. 3. · Introduction ExamplesStatistical Machine Learning (ML) Plays a central role in data

IntroductionExamples

Next Class

Introduction to R and RStudio

Please install RStudio on your laptop or PC

Please read one tutorial about R (available at our course page)

Statistical Decision Theory

Hao Helen Zhang 574M: Introduction to Statistical Machine Learning 26 / 26