RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classiﬁcation to regression (x 1,y 1),...,(x n,y n)

Francesca Odone and Lorenzo Rosasco

RegML 2013

Regularization Methods for High Dimensional Learning

Genova, June, 3-7 2013

Course organized within the PhD Program in Computer Science for thePhD School in Sciences and Technologies for Information and Knowledge (STIC) PhD School in Life and Humanoid Technologies

Regularization Methods for High Dimensional Learning Intro

Who are we?The course is co-organized by

•SLIPGURU, University of Genova •Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology


Schedule


Course Schedule and Material

•Slipguru: slipguru.disi.unige.it•LCSL: lcsl.mit.edu

COURSE MATERIAL

Introductory information at:http://slipguru.disi.unige.it/Teaching/odone_rosasco/

Slides and additional material on Aulaweb:http://stsi.aulaweb.unige.it/course/view.php?id=59

Material also available upon [email protected], [email protected]

Regularization Approaches to Learning Theory Regularization Approaches to Learning Theory

disi.unige.it/dottorato/corsi/RegML2013/

Nicoletta NocetiSilvia VillaAlessandra Stagliano’Sean FanelloGabriele ChiusanoAlessandro RudiLuca Zini

Teaching Assistants

Instructors e-mails

Other Sources

Exams?Credits?Certificates?Attendance?Housing?


What We Talk About When We Talk About (Machine) Learning

Francesca Odone and Lorenzo Rosasco

RegML 2013



Menu’

Appetizer: AI some context and some history

Entree: Machine Learning at a Glance

Main Course: Intro to Statistical Learning Theory


(Artificial) Intelligence

Build intelligent machines

Understand Intelligence

Science and Engineering of Intelligence


(Artificial) Intelligence: A Working Definition Turing test Ingredients for AI

• natural language processing • knowledge representation• automated reasoning • machine learning

• computer vision• robotics to manipulate

Alan Turing 1912-1954


(Artificial) Intelligence & its Neighbors

Neuroscience

Psychology

Cognitive Science

AI

Mathematics

Engineering

Computer Science

• What are the formal rules to draw valid conclusions? •What can be computed?• How do we reason with uncertain information?

Phylosophy


Birth of a Dream1943Arturo Rosenblueth, Norbert Wiener and Julian Bigelow coin the term "cybernetics". Wiener's popular book by that name published in 1948.1945Game theory which would prove invaluable in the progress of AI was introduced with the 1944 paper, Theory of Games and Economic Behavior by mathematician John von Neumann and economist Oskar Morgenstern.1945Vannevar Bush published As We May Think (The Atlantic Monthly, July 1945) a prescient vision of the future in which computers assist humans in many activities.1948John von Neumann (quoted by E.T. Jaynes) in response to a comment at a lecture that it was impossible for a machine to think: "You insist that there is something a machine cannot do. If you will tell me precisely what it is that a machine cannot do, then I can always make a machine which will do just that!". Von Neumann was presumably alluding to the Church-Turing thesis which states that any effective procedure can be simulated by a (generalized) computer....1950Alan Turing proposes the Turing Test as a measure of machine intelligence.1950Claude Shannon published a detailed analysis of chess playing as search.1955The first Dartmouth College summer AI conference is organized by John McCarthy, Marvin Minsky, Nathan Rochester of IBM andClaude Shannon.1956The name artificial intelligence is used for the first time as the topic of the second Dartmouth Conference, organized by John McCarthy[30]

.....................


How did it go?We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.

Dartmouth Summer Research Conference on Artificial Intelligence organised by John McCarthy and proposed by McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon.

Late 1990sWeb crawlers and other AI-based information extraction programs become essential in widespread use of the World Wide Web.1997The Deep Blue chess machine (IBM) beats the world chess champion, Garry Kasparov.2004DARPA introduces the DARPA Grand Challenge requiring competitors to produce autonomous vehicles for prize money.


10/15 years ago


How are we doing now?


Pedestrians Detection at Human Level Performance


ML and AI

Machine Learning

systems are trained on examples

rather than being programmed


Menu’





Basic Setting: Classification

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA

(x1, y1), . . . , (xn, yn)

xi 2 Rp and yi 2 Y = {�1, 1}, i = 1, . . . , n


Genomics

...

;

...

n patients p gene expression measurements

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;


Text Classification


Text Classification: Bag of Words

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA


Image Classification

b

handwriting

AutomaticCar Plate Reading

......

......


Image Classification

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA


From classification to regression

(x1, y1), . . . , (xn, yn)

xi 2 RD and yi 2 Y = {�1, 1}, i = 1, . . . , n

yi 2 Y 2 R, i = 1, . . . , n

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:

Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540...

...

We can plot this data:

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

100

200

300

400

500

600

700

800

900

1000

housing prices

square feet

pric

e (in

$10

00)

Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?

1

CS229 Lecture notes

Andrew Ng

Supervised learning

Let’s start by talking about a few examples of supervised learning problems.Suppose we have a dataset giving the living areas and prices of 47 housesfrom Portland, Oregon:

Living area (feet2) Price (1000$s)2104 4001600 3302400 3691416 2323000 540...

...

We can plot this data:

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

100

200

300

400

500

600

700

800

900

1000

housing prices

square feet

pric

e (in

$10

00)

Given data like this, how can we learn to predict the prices of other housesin Portland, as a function of the size of their living areas?

1

3

Part I

Linear Regression

To make our housing example more interesting, let’s consider a slightly richerdataset in which we also know the number of bedrooms in each house:

Living area (feet2) #bedrooms Price (1000$s)2104 3 4001600 3 3302400 3 3691416 2 2323000 4 540...

......

Here, the x’s are two-dimensional vectors in R2. For instance, x(i)1 is the

living area of the i-th house in the training set, and x(i)2 is its number of

bedrooms. (In general, when designing a learning problem, it will be up toyou to decide what features to choose, so if you are out in Portland gatheringhousing data, you might also decide to include other features such as whethereach house has a fireplace, the number of bathrooms, and so on. We’ll saymore about feature selection later, but for now let’s take the features asgiven.)

To perform supervised learning, we must decide how we’re going to rep-resent functions/hypotheses h in a computer. As an initial choice, let’s saywe decide to approximate y as a linear function of x:

h!(x) = !0 + !1x1 + !2x2

Here, the !i’s are the parameters (also called weights) parameterizing thespace of linear functions mapping from X to Y . When there is no risk ofconfusion, we will drop the ! subscript in h!(x), and write it more simply ash(x). To simplify our notation, we also introduce the convention of lettingx0 = 1 (this is the intercept term), so that

h(x) =n!

i=0

!ixi = !Tx,

where on the right-hand side above we are viewing ! and x both as vectors,and here n is the number of input variables (not counting x0).

yi = f(xi) + �"i, � > 0

e.g. f(x) = w

Tx, "i ⇠ N(0, 1)


Batch Learning(x1, y1), . . . , (xn, yn)

Inputs Outputs

f

X Y

S

LM

Gx y

f(x)


Machine Learning: Problems and Approaches

Learning Problems•Supervised Learning•Semisupervised•Online•....

Learning Approaches•Batch Learning •Online•Active •...


Variations on a Theme(x1, y1), . . . , (xn, yn)

Multiclass: xi 2 RD and yi 2 Y = {1, . . . , T}, i = 1, . . . , n

Multitask: xi 2 RD and yi 2 RT , i = 1, . . . , n

(x1, x1; y1,1), (x1, x2; y1,2) . . . , (xn, xn; yn,n)

xj , xi 2 RD and yi,j 2 [0, 1], j, i = 1, . . . , n

Learning a similarity function


Semisupervised Learning

Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;



Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;



Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;



Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;



Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;



Xu =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1u . . . . . . . . . x

pu

1

CA[Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA;

Manifold Learning


Online Learning

(x1, y1)

(x2, y2)

(xn, yn)

. . .

f1

f0

f2 . . .

fn



Learning Problems•Supervised Learning•Semisupervised•Online•Unsupervised Learning


S

LM

Gx y

f(x)


Unsupervised Learning

ClusteringDimensionality reductionLearning Data Representation....

Goal: Extract patterns...

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA

x1, . . . , xnGiven



Learning Problems•Supervised Learning•Semisupervised•Online•....



Online/Incremental Learning

(x1, y1)

(x2, y2)

(xn, yn)

. . .

f1

f0

f2. . .

fn (x1, y1)

f1

(x1, y1), . . . , (xn, yn)


Active Learning

(x1, y1)

(x2, y2)

(xn, yn)

. . .

f1

f0

f2 . . .

fn

196 FOUNDATIONS AND APPLICATIONS OF SENSOR MANAGEMENT

(a) (b)

(c) (d)

Figure 8.7. The two step procedure for d = 2: (a) Initial unpruned RDP and n/2 samples.(b) Preview step RDP. Note that the cell with the arrow was pruned, but it contains a part of theboundary. (c) Additional sampling for the refinement step. (d) Refinement step.

The final estimator is constructed assembling the estimate “away” from theboundary obtained in the preview step with the estimate in the vicinity of theboundary obtained in the refinement step.

To formally show that this algorithm attains the faster rates we desire wehave to consider a further technical assumption, namely that the boundary setis “cusp-free”2. This condition is rather technical, but it is not very restrictive,and encompasses many interesting situations, including of course, boundaryfragments. This condition seems to be necessary for the algorithm to performwell, and it is not simply an artifact of the proof. For a more detailed explana-tion see [52]. Under this condition we have the following theorem.

2A cusp-free boundary cannot have the behavior you observe in the graph of |x|1/2 at the origin. Less“aggressive” kinks are allowed, such as in the graph of |x|.

Learner can query points


Some Remarks

We look for computer systems that are trained, rather than programmed, to perform a task

Learning from examples is a unifying paradigm in AI:

It allows to exploit the availability of data and computational resources

``Learning is the acquisition of knowledge or skills through study, experience, or being taught’’


Menu’





**Warning**Math

The course contains many ideas and (quite) a bit of math, questions help prevent sleeping...


Training Set

Given a Training Set

f(x) ⇠ y

Find

S = (x1, y1), . . . , (xn, yn)


Loss function

We need a way to measure errors

Loss functionV (f(x), y)


Loss function examples• 0� 1-loss V (f(x), y) = ✓(�yf(x)) (✓ is the step function)

• square loss (L2) V (f(x), y) = (f(x)� y)

2= (1� yf(x))

2

• absolute value (L1) V (f(x), y) = |f(x)� y|

• Vapnik’s ✏-insensitive loss V (f(x), y) = (|f(x)� y|� ✏)+

• hinge loss V (f(x), y) = (1� yf(x))+

• logistic loss V (f(x), y) = log(1� e

�yf(x)) logistic regression

• exponential loss V (f(x), y) = e

�yf(x)


Empirical error

IS [f ] =1n

Pni=1 V (f(xi), yi)

Given a loss function V (f(x), y)

We can define the Empirical Error


Hypotheses Space

``Learning processes do not take place in vacuum.’’

Cucker and Smale, AMS 2001

We need to fix a Hypotheses Space

H ⇢ F = {f | f : X ! Y }

F

H


Hypotheses Space

• Linear model f(x) =Pp

j=1 xjw

j

• Generalized linear models f(x) =Pp

j=1 �(x)jw

j

• Reproducing kernel Hilbert spaces f(x) =P

j�1 �(x)jw

j =P

i�1 K(x, xi)↵i

K(x, x0) is a symmetric positive definite function called reproducing kernel

parametric

non-parametric

F

HH ⇢ F = {f | f : X ! Y }


Hypotheses Space


j=1 xjw

j


j=1 �(x)jw

j


j�1 �(x)jw

j =P

i�1 K(x, xi)↵i


parametric

semi-parametric

H ⇢ F = {f | f : X ! Y }

F

H


Hypotheses Space


j=1 xjw

j


j=1 �(x)jw

j


j�1 �(x)jw

j =P

i�1 K(x, xi)↵i


parametric

non-parametric

H ⇢ F = {f | f : X ! Y }

F

H

semi-parametric


Minimizing the empirical error

Empirical Risk Minimization (ERM)

minf2H

IS [f ] = minf2H

1

n

nX

i=1

V (f(xi), yi)




minf2H

IS [f ] = minf2H

1

n

nX

i=1

V (f(xi), yi)




Which one is a good solution?

minf2H

ES [f ] = minf2H

1

n

nX

i=1

V (f(xi), yi)


Statistical Learning: Overfitting and Generalization

CS229 Fall 2012 2

To establish notation for future use, we’ll use x(i) to denote the “input”variables (living area in this example), also called input features, and y(i)

to denote the “output” or target variable that we are trying to predict(price). A pair (x(i), y(i)) is called a training example, and the datasetthat we’ll be using to learn—a list of m training examples {(x(i), y(i)); i =1, . . . , m}—is called a training set. Note that the superscript “(i)” in thenotation is simply an index into the training set, and has nothing to do withexponentiation. We will also use X denote the space of input values, and Ythe space of output values. In this example, X = Y = R.

To describe the supervised learning problem slightly more formally, ourgoal is, given a training set, to learn a function h : X !" Y so that h(x) is a“good” predictor for the corresponding value of y. For historical reasons, thisfunction h is called a hypothesis. Seen pictorially, the process is thereforelike this:

Training set

house.)(living area of

Learning algorithm

h predicted yx(predicted price)of house)

When the target variable that we’re trying to predict is continuous, suchas in our housing example, we call the learning problem a regression prob-lem. When y can take on only a small number of discrete values (such asif, given the living area, we wanted to predict if a dwelling is a house or anapartment, say), we call it a classification problem.

S = (x1, y1), . . . , (xn, yn)

The training set

is sampled identically and independently (i.i.d) from a fixed unknown probability distribution p(x, y) = p(x)p(y|x)


Generalization and Stability ERM AND ILL-POSEDNESS

Ill posed problems often arise if one tries to infer general laws fromfew data

the hypothesis space is too largethere are not enough data

In general ERM leads to ill-posedsolutions because

the solution may be too complex

it may be not unique

it may change radically whenleaving one sample out

Foundations of Computational and Statistical Learning Foundations of Computational and Statistical Learning

Learning is an ill-posed problem

Jacques Hadamard

Regularization Theory provides results and techniques to restore well-posedness, that is stability (hence generalization)


Theory of Machine Learning

•Beyond drawings & intuitions (...) there is a deep, rigorous mathematical foundation of regularized learning algorithms (Cucker and Smale, Vapnik and Chervonenkis, ).

Theory of learning is a synthesis of different fields, e.g. Computer Science (Algorithms, Complexity) and Mathematics (Optimization, Probability, Statistics).

•Central to the Theory of Machine Learning is the problem of understanding condition under which ERM can solve

inf E(f), E(f) = E(x,y) V (y, f(x))


(Tikhonov) Regularization

minf2H

{ 1n

nX

i=1

V (yi, f(xi)) + �R(f))} ! f

�S

regularization parameter

regularizer•The regularizer describes the complexity of the solution

R(f2) is bigger than R(f1)

f1 f2

•The regularization parameter determines the trade-off between complexity and empirical risk


Some Remarks and Some Questions

•Supervised learning in statistical learning theory: basic concepts/notation.

•The regularization approach:•Which hypotheses space? Which regularizer?•How can we find a solution in an efficient way?•How do we solve the fitting/regularizing trade-off?

Documents

RegML 2013 Regularization Methods for High Dimensional ... · Regularization Methods for High Dimensional Learning Intro From classiﬁcation to regression (x 1,y 1),...,(x n,y n)