Course 16:198:520: Introduction To Arti cial Intelligence ...ab1544/fall2016/440/lecture11.pdf · ... Introduction To Arti cial Intelligence Lecture 10 ... from David J.C. MacKay

Course 16:198:520: Introduction To Artificial IntelligenceLecture 10

Introduction To Machine Learning

Abdeslam Boularias

Wednesday, November 16, 2016

1 / 29

Outline

1 What is Machine Learning?

2 Project

2 / 29

What is Intelligence?

Intelligence is a goal-directed adaptive behavior.

An intelligent behavior is:

Goal-directed: search and inference.

Adaptive: learning from observations.

Search and Inference have been covered in the previous lectures

Search: based on the current knowledge of a problem, what is thebest sequence of actions to solve it?

Inference: based on the current knowledge of a problem, what is theprobability of some event?

3 / 29









4 / 29









Where does the knowledge about a problem come from?

5 / 29

Adaptive Behavior

The robot can see only nearby obstacles.

The robot iterates between:1 Searching for a path based the current knowledge.2 Following the path while simultaneously learning about new obstacles

and updating the current knowledge.

This is a goal-directed adaptive behavior.

Robot in a maze ( c©Robo Bionic)

6 / 29

Adaptive Behavior

In order to act successfully in a complex environment, biological systemshave developed adaptive behaviors through learning and evolution.

Sunflowers tracking the sun.Copyright Wikimedia Commons

The Ebola virus entering a cell.Copyright Nature, 2011 7 / 29

What is Learning?

Kupfermann (1985)

Learning is the acquisition of knowledge about the world.

Problem

Is learning simply a process of memorizing knowledge?

Shepherd (1988)

Learning is an adaptive change in behavior caused by experience.

8 / 29

What is Learning?

Kupfermann (1985)


Problem


Shepherd (1988)


9 / 29

What is Learning?

Kupfermann (1985)


Problem


Shepherd (1988)


10 / 29

What is Machine Learning?

Ron Kohavi; Foster Provost (1998). ”Glossary of terms“

Machine Learning is a subfield of computer science that explores the studyand construction of algorithms that can learn from and make predictionson data.

Machine Learning searches for patterns (regularities) in data that allowsthe prediction of new data.This is also known as empirical inference.

Empirical Inference

Data, observations ⇒ rules, models

11 / 29

What is Machine Learning?

Empirical Inference


How is machine learning different statistics?

Both machine learning and statistics are concerned with summarizingdata (or extracting rules from data).

Statistics focus on data analysis (e.g, hypothesis testing).

Machine learning is more concerned with finding efficient algorithms:algorithms that run fast and require as little data as possible to makepredictions that are as accurate as possible.

12 / 29

Empirical Inference

Empirical Inference


Extracting rules from observations has always been the quest of science.

Example

Observations of the movements of planets(from The Boy Scientist)

⇒Law of gravity

13 / 29

Empirical Inference

Empirical Inference



Example

Observations of data points (x, y)

⇒ Y = aX + b

Law describing the relationbetween x and y.

14 / 29

Empirical Inference

Empirical Inference



Is machine learning the automatization of science?

Physics searches for laws explaining simple observations about theuniverse.

Machine learning searches for laws explaining complex observations,such as protein structures, gene expressions, speech, text, and images.

For example, machine learning is increasingly becoming an essentialtool in biology.

15 / 29

Empirical Inference

Empirical Inference


Example: Extract a law that maps an image to a digit

Observations of data points (image, digit)

⇒

f( )

= 1

f( )

= 2

f( )

= 3

...

Law f describing the relationbetween images and digits.

16 / 29

Generalization

Empirical Inference


Generalization

The rule (or model) should be used to predict new observations.

17 / 29

Generalization

Example

Observe: 1, 2, 4, 7, . . .

What is next?

1, 2, 4, 7, 11, 16, . . . : an+1 = an + n

1, 2, 4, 7, 12, 20, . . . : an+2 = an+1 + an + 1

1, 2, 4, 7, 14, 28, . . . : divisors of 28.

1, 2, 4, 7, 1, 1, 5 . . . : decimal expansions of π = 3.14159 . . . ande = 2.718 . . . interleaved.

Which of these answers is the right one?

We don’t know.

18 / 29

Generalization

Example

Observe: 1, 2, 4, 7, . . .

What is next?

1, 2, 4, 7, 11, 16, . . . : an+1 = an + n

1, 2, 4, 7, 12, 20, . . . : an+2 = an+1 + an + 1

1, 2, 4, 7, 14, 28, . . . : divisors of 28.



We don’t know.

19 / 29

Generalization

Example

Observe: 1, 2, 4, 7, . . .

What is next?

1, 2, 4, 7, 11, 16, . . . : an+1 = an + n

1, 2, 4, 7, 12, 20, . . . : an+2 = an+1 + an + 1

1, 2, 4, 7, 14, 28, . . . : divisors of 28.



We don’t know.

20 / 29

Generalization

Example

Observe: 1, 2, 4, 7, . . .

What is next?

1, 2, 4, 7, 11, 16, . . . : an+1 = an + n

1, 2, 4, 7, 12, 20, . . . : an+2 = an+1 + an + 1

1, 2, 4, 7, 14, 28, . . . : divisors of 28.



We don’t know.

21 / 29

Generalization

Example

Observe: 1, 2, 4, 7, . . .

What is next?

1, 2, 4, 7, 11, 16, . . . : an+1 = an + n

1, 2, 4, 7, 12, 20, . . . : an+2 = an+1 + an + 1

1, 2, 4, 7, 14, 28, . . . : divisors of 28.



We don’t know.

22 / 29

Generalization

Example

Observe: 1, 2, 4, 7, . . .

What is next?

1, 2, 4, 7, 11, 16, . . . : an+1 = an + n

1, 2, 4, 7, 12, 20, . . . : an+2 = an+1 + an + 1

1, 2, 4, 7, 14, 28, . . . : divisors of 28.



We don’t know.

23 / 29

Generalization

Principle of Occam’s razor

Among competing hypotheses, the one with the fewest assumptions shouldbe selected.In other terms, the simplest model (or rule) is the one that will most likelymake the smallest generalization errors.

Example

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

28

Model Comparison and Occam’s Razor

��

��

Figure 28.1. A picture to beinterpreted. It contains a tree andsome boxes.

�28.1 Occam’s razor

How many boxes are in the picture (figure 28.1)? In particular, how manyboxes are in the vicinity of the tree? If we looked with x-ray spectacles,would we see one or two boxes behind the trunk (figure 28.2)? (Or evenmore?) Occam’s razor is the principle that states a preference for simple

1?

or 2?

Figure 28.2. How many boxes arebehind the tree?

theories. ‘Accept the simplest explanation that fits the data’. Thus accordingto Occam’s razor, we should deduce that there is only one box behind the tree.Is this an ad hoc rule of thumb? Or is there a convincing reason for believingthere is most likely one box? Perhaps your intuition likes the argument ‘well,it would be a remarkable coincidence for the two boxes to be just the sameheight and colour as each other’. If we wish to make artificial intelligencesthat interpret data correctly, we must translate this intuitive feeling into aconcrete theory.

Motivations for Occam’s razor

If several explanations are compatible with a set of observations, Occam’srazor advises us to buy the simplest. This principle is often advocated for oneof two reasons: the first is aesthetic (‘A theory with mathematical beauty ismore likely to be correct than an ugly one that fits some experimental data’

343

Observation

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

28

Model Comparison and Occam’s Razor

��

��

Figure 28.1. A picture to beinterpreted. It contains a tree andsome boxes.

�28.1 Occam’s razor

How many boxes are in the picture (figure 28.1)? In particular, how manyboxes are in the vicinity of the tree? If we looked with x-ray spectacles,would we see one or two boxes behind the trunk (figure 28.2)? (Or evenmore?) Occam’s razor is the principle that states a preference for simple

1?

or 2?

Figure 28.2. How many boxes arebehind the tree?

theories. ‘Accept the simplest explanation that fits the data’. Thus accordingto Occam’s razor, we should deduce that there is only one box behind the tree.Is this an ad hoc rule of thumb? Or is there a convincing reason for believingthere is most likely one box? Perhaps your intuition likes the argument ‘well,it would be a remarkable coincidence for the two boxes to be just the sameheight and colour as each other’. If we wish to make artificial intelligencesthat interpret data correctly, we must translate this intuitive feeling into aconcrete theory.

Motivations for Occam’s razor

If several explanations are compatible with a set of observations, Occam’srazor advises us to buy the simplest. This principle is often advocated for oneof two reasons: the first is aesthetic (‘A theory with mathematical beauty ismore likely to be correct than an ugly one that fits some experimental data’

343

Which hypothesis is true, 1 or 2?

from David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. 24 / 29

Project: Image Classification

Which Digit? Which are Faces?

25 / 29


Data:http://www.cs.rutgers.edu/~ab1544/dataAndCode/data.zip

Which Digit?Face or not face? 26 / 29

http://www.cs.rutgers.edu/~ab1544/dataAndCode/data.zip


What you should do

1 Implement three classification algorithms for detecting faces andclassifying digits:

PerceptronNaive Bayes ClassifierAn algorithm of your choice

2 Design the features for each of the two problems.

3 Compare the three algorithms, and report the prediction error (andstandard deviation) as a function of the number of data points usedfor trainning.

4 Write a small report (minimum two pages) describing theimplemented algorithms and discussing the results.

5 Submit the code and the report by December 15, 2016.

6 Setup an appointement with the TA for demonstrating yoursubmitted program.

27 / 29


Do NOT use an existing library for the learning algorithms!

It’s OK to share ideas, but not code or writing.

Part of your score will depend on the accuracy of the predictionsmade by your program.

The data set is separated into three sets:

Training and validation: used to learn and find the parameters of yourmodel.Testing: used to evaluate the learned model.

Your algorithm should not look at the testing data before the trainingis over. If you use any testing data point for training, that would beconsidered as cheating.

28 / 29


Acknowledgement: This topic is based on the one created by Dan Kleinand John DeNero that was given as part of the programming assignmentsof Berkeley’s CS188 course.https://www.cs.utexas.edu/~pstone/Courses/343spring10/assignments/classification/classification.html

http://inst.eecs.berkeley.edu/~cs188/sp11/projects/classification/classification.html

29 / 29

https://www.cs.utexas.edu/~pstone/Courses/343spring10/assignments/classification/classification.html

http://inst.eecs.berkeley.edu/~cs188/sp11/projects/classification/classification.html

Documents

Course 16:198:520: Introduction To Arti cial Intelligence ...ab1544/fall2016/440/lecture11.pdf · ... Introduction To Arti cial Intelligence Lecture 10 ... from David J.C. MacKay