57
Francesca Odone and Lorenzo Rosasco RegML 2013 Regularization Methods for High Dimensional Learning Genova, June, 3-7 2013 Course organized within the PhD Program in Computer Science for the PhD School in Sciences and Technologies for Information and Knowledge (STIC) PhD School in Life and Humanoid Technologies

RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Francesca Odone and Lorenzo Rosasco

RegML  2013

Regularization Methods for High Dimensional Learning

Genova, June, 3-7 2013

 Course organized within the PhD Program in Computer Science for thePhD School in Sciences and Technologies for Information and Knowledge (STIC) PhD School in Life and Humanoid Technologies

Page 2: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Who are we?The course is co-organized by

•SLIPGURU, University of Genova •Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology

Page 3: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Schedule

Page 4: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Course Schedule and Material

•Slipguru: slipguru.disi.unige.it•LCSL: lcsl.mit.edu

COURSE MATERIAL

Introductory information at:http://slipguru.disi.unige.it/Teaching/odone_rosasco/

Slides and additional material on Aulaweb:http://stsi.aulaweb.unige.it/course/view.php?id=59

Material also available upon [email protected], [email protected]

Regularization Approaches to Learning Theory Regularization Approaches to Learning Theory

disi.unige.it/dottorato/corsi/RegML2013/

Nicoletta NocetiSilvia VillaAlessandra Stagliano’Sean FanelloGabriele ChiusanoAlessandro RudiLuca Zini

Teaching Assistants

Instructors e-mails

Other Sources

Exams?Credits?Certificates?Attendance?Housing?

Page 5: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

What We Talk About When We Talk About (Machine) Learning

Francesca Odone and Lorenzo Rosasco

RegML 2013

Regularization Methods for High Dimensional Learning Intro

Page 6: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Menu’

Appetizer: AI some context and some history

Entree: Machine Learning at a Glance

Main Course: Intro to Statistical Learning Theory

Page 7: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

(Artificial) Intelligence

Build intelligent machines

Understand Intelligence

Science and Engineering of Intelligence

Page 8: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

(Artificial) Intelligence: A Working Definition Turing test Ingredients for AI

• natural language processing • knowledge representation• automated reasoning • machine learning

• computer vision• robotics to manipulate

Alan Turing 1912-1954

Page 9: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

(Artificial) Intelligence & its Neighbors

Neuroscience

Psychology

Cognitive Science

AI

Mathematics

Engineering

Computer Science

• What are the formal rules to draw valid conclusions? •What can be computed?• How do we reason with uncertain information?

Phylosophy

Page 10: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Birth of a Dream1943Arturo Rosenblueth, Norbert Wiener and Julian Bigelow coin the term "cybernetics". Wiener's popular book by that name published in 1948.1945Game theory which would prove invaluable in the progress of AI was introduced with the 1944 paper, Theory of Games and Economic Behavior by mathematician John von Neumann and economist Oskar Morgenstern.1945Vannevar Bush published As We May Think (The Atlantic Monthly, July 1945) a prescient vision of the future in which computers assist humans in many activities.1948John von Neumann (quoted by E.T. Jaynes) in response to a comment at a lecture that it was impossible for a machine to think: "You insist that there is something a machine cannot do. If you will tell me precisely what it is that a machine cannot do, then I can always make a machine which will do just that!". Von Neumann was presumably alluding to the Church-Turing thesis which states that any effective procedure can be simulated by a (generalized) computer....1950Alan Turing proposes the Turing Test as a measure of machine intelligence.1950Claude Shannon published a detailed analysis of chess playing as search.1955The first Dartmouth College summer AI conference is organized by John McCarthy, Marvin Minsky, Nathan Rochester of IBM andClaude Shannon.1956The name artificial intelligence is used for the first time as the topic of the second Dartmouth Conference, organized by John McCarthy[30]

.....................

Page 11: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

How did it go?We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.

Dartmouth Summer Research Conference on Artificial Intelligence organised by John McCarthy and proposed by McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon.

Late 1990sWeb crawlers and other AI-based information extraction programs become essential in widespread use of the World Wide Web.1997The Deep Blue chess machine (IBM) beats the world chess champion, Garry Kasparov.2004DARPA introduces the DARPA Grand Challenge requiring competitors to produce autonomous vehicles for prize money.

Page 12: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

10/15 years ago

Page 13: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

How are we doing now?

Page 14: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Pedestrians Detection at Human Level Performance

Page 15: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

ML and AI

Machine Learning

systems are trained on examples

rather than being programmed

Page 16: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Menu’

Appetizer: AI some context and some history

Entree: Machine Learning at a Glance

Main Course: Intro to Statistical Learning Theory

Page 17: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Basic Setting: Classification

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA

(x1, y1), . . . , (xn, yn)

xi 2 Rp and yi 2 Y = {�1, 1}, i = 1, . . . , n

Page 18: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Genomics

...

;

...

n patients p gene expression measurements

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;

Page 19: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Text Classification

Page 20: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Text Classification: Bag of Words

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA

Page 21: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Image Classification

b

handwriting

AutomaticCar Plate Reading

......

......

Page 22: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Image Classification

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA

Page 23: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

From classification to regression

(x1, y1), . . . , (xn, yn)

xi 2 RD and yi 2 Y = {�1, 1}, i = 1, . . . , n

yi 2 Y 2 R, i = 1, . . . , n

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:

Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540...

...

We can plot this data:

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

100

200

300

400

500

600

700

800

900

1000

housing prices

square feet

pric

e (in

$10

00)

Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?

1

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:

Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540...

...

We can plot this data:

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

100

200

300

400

500

600

700

800

900

1000

housing prices

square feet

pric

e (in

$10

00)

Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?

1

3

Part I

Linear Regression

To make our housing example more interesting, let’s consider a slightly richerdataset in which we also know the number of bedrooms in each house:

Living area (feet2) #bedrooms Price (1000$s)2104 3 4001600 3 3302400 3 3691416 2 2323000 4 540...

......

Here, the x’s are two-dimensional vectors in R2. For instance, x(i)1 is the

living area of the i-th house in the training set, and x(i)2 is its number of

bedrooms. (In general, when designing a learning problem, it will be up toyou to decide what features to choose, so if you are out in Portland gatheringhousing data, you might also decide to include other features such as whethereach house has a fireplace, the number of bathrooms, and so on. We’ll saymore about feature selection later, but for now let’s take the features asgiven.)

To perform supervised learning, we must decide how we’re going to rep-resent functions/hypotheses h in a computer. As an initial choice, let’s saywe decide to approximate y as a linear function of x:

h!(x) = !0 + !1x1 + !2x2

Here, the !i’s are the parameters (also called weights) parameterizing thespace of linear functions mapping from X to Y . When there is no risk ofconfusion, we will drop the ! subscript in h!(x), and write it more simply ash(x). To simplify our notation, we also introduce the convention of lettingx0 = 1 (this is the intercept term), so that

h(x) =n!

i=0

!ixi = !Tx,

where on the right-hand side above we are viewing ! and x both as vectors,and here n is the number of input variables (not counting x0).

yi = f(xi) + �"i, � > 0

e.g. f(x) = w

Tx, "i ⇠ N(0, 1)

Page 24: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Batch Learning(x1, y1), . . . , (xn, yn)

Inputs Outputs

f

X Y

S

LM

Gx y

f(x)

Page 25: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Machine Learning: Problems and Approaches

Learning Problems•Supervised Learning•Semisupervised•Online•....

Learning Approaches•Batch Learning •Online•Active •...

Page 26: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Variations on a Theme(x1, y1), . . . , (xn, yn)

Multiclass: xi 2 RD and yi 2 Y = {1, . . . , T}, i = 1, . . . , n

Multitask: xi 2 RD and yi 2 RT , i = 1, . . . , n

(x1, x1; y1,1), (x1, x2; y1,2) . . . , (xn, xn; yn,n)

xj , xi 2 RD and yi,j 2 [0, 1], j, i = 1, . . . , n

Learning a similarity function

Page 27: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Semisupervised Learning

Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;

Page 28: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Semisupervised Learning

Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;

Page 29: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Semisupervised Learning

Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;

Page 30: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Semisupervised Learning

Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;

Page 31: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Semisupervised Learning

Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;

Page 32: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Semisupervised Learning

Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;

Manifold Learning

Page 33: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Online Learning

(x1, y1)

(x2, y2)

(xn, yn)

. . .

f1

f0

f2 . . .

fn

Page 34: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Machine Learning: Problems and Approaches

Learning Problems•Supervised Learning•Semisupervised•Online•Unsupervised Learning

Learning Approaches•Batch Learning •Online•Active •...

S

LM

Gx y

f(x)

Page 35: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Unsupervised Learning

ClusteringDimensionality reductionLearning Data Representation....

Goal: Extract patterns...

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA

x1, . . . , xnGiven

Page 36: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Machine Learning: Problems and Approaches

Learning Problems•Supervised Learning•Semisupervised•Online•....

Learning Approaches•Batch Learning •Online•Active •...

Page 37: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Online/Incremental Learning

(x1, y1)

(x2, y2)

(xn, yn)

. . .

f1

f0

f2. . .

fn (x1, y1)

f1

(x1, y1), . . . , (xn, yn)

Page 38: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Active Learning

(x1, y1)

(x2, y2)

(xn, yn)

. . .

f1

f0

f2 . . .

fn

196 FOUNDATIONS AND APPLICATIONS OF SENSOR MANAGEMENT

(a) (b)

(c) (d)

Figure 8.7. The two step procedure for d = 2: (a) Initial unpruned RDP and n/2 samples.(b) Preview step RDP. Note that the cell with the arrow was pruned, but it contains a part of theboundary. (c) Additional sampling for the refinement step. (d) Refinement step.

The final estimator is constructed assembling the estimate “away” from theboundary obtained in the preview step with the estimate in the vicinity of theboundary obtained in the refinement step.

To formally show that this algorithm attains the faster rates we desire wehave to consider a further technical assumption, namely that the boundary setis “cusp-free”2. This condition is rather technical, but it is not very restrictive,and encompasses many interesting situations, including of course, boundaryfragments. This condition seems to be necessary for the algorithm to performwell, and it is not simply an artifact of the proof. For a more detailed explana-tion see [52]. Under this condition we have the following theorem.

2A cusp-free boundary cannot have the behavior you observe in the graph of |x|1/2 at the origin. Less“aggressive” kinks are allowed, such as in the graph of |x|.

Learner can query points

Page 39: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Some Remarks

We look for computer systems that are trained, rather than programmed, to perform a task

Learning from examples is a unifying paradigm in AI:

It allows to exploit the availability of data and computational resources

``Learning is the acquisition of knowledge or skills through study, experience, or being taught’’

Page 40: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Menu’

Appetizer: AI some context and some history

Entree: Machine Learning at a Glance

Main Course: Intro to Statistical Learning Theory

Page 41: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

**Warning**Math

The course contains many ideas and (quite) a bit of math, questions help prevent sleeping...

Page 42: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Training Set

Given a Training Set

f(x) ⇠ y

Find

S = (x1, y1), . . . , (xn, yn)

Page 43: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Loss function

We need a way to measure errors

Loss functionV (f(x), y)

Page 44: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Loss function examples• 0� 1-loss V (f(x), y) = ✓(�yf(x)) (✓ is the step function)

• square loss (L2) V (f(x), y) = (f(x)� y)

2= (1� yf(x))

2

• absolute value (L1) V (f(x), y) = |f(x)� y|

• Vapnik’s ✏-insensitive loss V (f(x), y) = (|f(x)� y|� ✏)+

• hinge loss V (f(x), y) = (1� yf(x))+

• logistic loss V (f(x), y) = log(1� e

�yf(x)) logistic regression

• exponential loss V (f(x), y) = e

�yf(x)

Page 45: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Empirical error

IS [f ] =1n

Pni=1 V (f(xi), yi)

Given a loss function V (f(x), y)

We can define the Empirical Error

Page 46: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Hypotheses Space

``Learning processes do not take place in vacuum.’’

Cucker and Smale, AMS 2001

We need to fix a Hypotheses Space

H ⇢ F = {f | f : X ! Y }

F

H

Page 47: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Hypotheses Space

• Linear model f(x) =Pp

j=1 xjw

j

• Generalized linear models f(x) =Pp

j=1 �(x)jw

j

• Reproducing kernel Hilbert spaces f(x) =P

j�1 �(x)jw

j =P

i�1 K(x, xi)↵i

K(x, x0) is a symmetric positive definite function called reproducing kernel

parametric

non-parametric

F

HH ⇢ F = {f | f : X ! Y }

Page 48: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Hypotheses Space

• Linear model f(x) =Pp

j=1 xjw

j

• Generalized linear models f(x) =Pp

j=1 �(x)jw

j

• Reproducing kernel Hilbert spaces f(x) =P

j�1 �(x)jw

j =P

i�1 K(x, xi)↵i

K(x, x0) is a symmetric positive definite function called reproducing kernel

parametric

semi-parametric

H ⇢ F = {f | f : X ! Y }

F

H

Page 49: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Hypotheses Space

• Linear model f(x) =Pp

j=1 xjw

j

• Generalized linear models f(x) =Pp

j=1 �(x)jw

j

• Reproducing kernel Hilbert spaces f(x) =P

j�1 �(x)jw

j =P

i�1 K(x, xi)↵i

K(x, x0) is a symmetric positive definite function called reproducing kernel

parametric

non-parametric

H ⇢ F = {f | f : X ! Y }

F

H

semi-parametric

Page 50: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Minimizing the empirical error

Empirical Risk Minimization (ERM)

minf2H

IS [f ] = minf2H

1

n

nX

i=1

V (f(xi), yi)

Page 51: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Minimizing the empirical error

Empirical Risk Minimization (ERM)

minf2H

IS [f ] = minf2H

1

n

nX

i=1

V (f(xi), yi)

Page 52: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Minimizing the empirical error

Empirical Risk Minimization (ERM)

Which one is a good solution?

minf2H

ES [f ] = minf2H

1

n

nX

i=1

V (f(xi), yi)

Page 53: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Statistical Learning: Overfitting and Generalization

CS229 Fall 2012 2

To establish notation for future use, we’ll use x(i) to denote the “input”variables (living area in this example), also called input features, and y(i)

to denote the “output” or target variable that we are trying to predict(price). A pair (x(i), y(i)) is called a training example, and the datasetthat we’ll be using to learn—a list of m training examples {(x(i), y(i)); i =1, . . . , m}—is called a training set. Note that the superscript “(i)” in thenotation is simply an index into the training set, and has nothing to do withexponentiation. We will also use X denote the space of input values, and Ythe space of output values. In this example, X = Y = R.

To describe the supervised learning problem slightly more formally, ourgoal is, given a training set, to learn a function h : X !" Y so that h(x) is a“good” predictor for the corresponding value of y. For historical reasons, thisfunction h is called a hypothesis. Seen pictorially, the process is thereforelike this:

Training set

house.)(living area of

Learning algorithm

h predicted yx(predicted price)of house)

When the target variable that we’re trying to predict is continuous, suchas in our housing example, we call the learning problem a regression prob-lem. When y can take on only a small number of discrete values (such asif, given the living area, we wanted to predict if a dwelling is a house or anapartment, say), we call it a classification problem.

S = (x1, y1), . . . , (xn, yn)

The training set

is sampled identically and independently (i.i.d) from a fixed unknown probability distribution p(x, y) = p(x)p(y|x)

Page 54: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Generalization and Stability ERM AND ILL-POSEDNESS

Ill posed problems often arise if one tries to infer general laws fromfew data

the hypothesis space is too largethere are not enough data

In general ERM leads to ill-posedsolutions because

the solution may be too complex

it may be not unique

it may change radically whenleaving one sample out

Foundations of Computational and Statistical Learning Foundations of Computational and Statistical Learning

Learning is an ill-posed problem

Jacques Hadamard

Regularization Theory provides results and techniques to restore well-posedness, that is stability (hence generalization)

Page 55: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Theory of Machine Learning

•Beyond drawings & intuitions (...) there is a deep, rigorous mathematical foundation of regularized learning algorithms (Cucker and Smale, Vapnik and Chervonenkis, ).

Theory of learning is a synthesis of different fields, e.g. Computer Science (Algorithms, Complexity) and Mathematics (Optimization, Probability, Statistics).

•Central to the Theory of Machine Learning is the problem of understanding condition under which ERM can solve

inf E(f), E(f) = E(x,y) V (y, f(x))

Page 56: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

(Tikhonov) Regularization

minf2H

{ 1n

nX

i=1

V (yi, f(xi)) + �R(f))} ! f

�S

regularization parameter

regularizer•The regularizer describes the complexity of the solution

R(f2) is bigger than R(f1)

f1 f2

•The regularization parameter determines the trade-off between complexity and empirical risk

Page 57: RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classification to regression (x 1,y 1),...,(x n,y n)

Regularization Methods for High Dimensional Learning Intro

Some Remarks and Some Questions

•Supervised learning in statistical learning theory: basic concepts/notation.

•The regularization approach:•Which hypotheses space? Which regularizer?•How can we find a solution in an efficient way?•How do we solve the fitting/regularizing trade-off?