215
Occam’s Razor MDL Principle Universal Source Coding MDL Principle (contd.) Introduction to Information-Theoretic Modeling Teemu Roos Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki BCSMIF, Maresias, April 11–12, 2011 Teemu Roos Introduction to Information-Theoretic Modeling

Introduction to Information-Theoretic Modeling

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Introduction to Information-Theoretic Modeling

Teemu Roos

Helsinki Institute for Information Technology HIIT,Department of Computer Science, University of Helsinki

BCSMIF, Maresias, April 11–12, 2011

Teemu Roos Introduction to Information-Theoretic Modeling

Page 2: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Jorma Rissanen (left) receiving the IEEE Information TheorySociety Best Paper Award from Claude Shannon in 1986.

IEEE Golden Jubilee Award forTechnological Innovation (for theinvention of arithmetic coding)1998; IEEE Richard W. HammingMedal (for fundamentalcontribution to information theory,statistical inference, control theory,and the theory of complexity)1993; Kolmogorov Medal 2006;IBM Outstanding InnovationAward (for work in statisticalinference, information theory, andthe theory of complexity) 1988;IEEE Claude E. Shannon Award2009; ...

Teemu Roos Introduction to Information-Theoretic Modeling

Page 3: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

1 Occam’s Razor

2 MDL Principle

3 Universal Source Coding

4 MDL Principle (contd.)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 4: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

1 Occam’s Razor

2 MDL Principle

3 Universal Source Coding

4 MDL Principle (contd.)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 5: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

1 Occam’s Razor

2 MDL Principle

3 Universal Source Coding

4 MDL Principle (contd.)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 6: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

1 Occam’s Razor

2 MDL Principle

3 Universal Source Coding

4 MDL Principle (contd.)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 7: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

1 Occam’s RazorHouseRazor

2 MDL Principle

3 Universal Source Coding

4 MDL Principle (contd.)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 8: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Teemu Roos Introduction to Information-Theoretic Modeling

Page 9: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 10: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 11: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 12: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 13: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 14: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 15: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 16: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 17: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 pneumonia,

common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 meningitis.

common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 18: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 common cold,

2 appendicitis,

gout medicine,

3 food poisoning,

gout medicine,

4 hemorrhage,

gout medicine,

5 common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 19: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

House

Brandon has

1 cough,

2 severe abdominal pain,

3 nausea,

4 low blood pressure,

5 fever.

1 common cold,

2 gout medicine,

3 gout medicine,

4 gout medicine,

5 common cold.

No single disease causes all of these.

Each symptom can be caused by some (possibly different) disease...

Dr. House explains the symptoms with two simple causes:

1 common cold, causing the cough and fever,

2 pharmacy error: cough medicine replaced by gout medicine.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 20: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

William of Ockham (c. 1288–1348)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 21: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural thingsthan such as are both true and sufficient to explain theirappearances.”

Diagnostic parsimony: Find the fewest possible causes thatexplain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damnwell please.”)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 22: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural thingsthan such as are both true and sufficient to explain theirappearances.”

Diagnostic parsimony: Find the fewest possible causes thatexplain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damnwell please.”)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 23: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural thingsthan such as are both true and sufficient to explain theirappearances.”

Diagnostic parsimony: Find the fewest possible causes thatexplain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damnwell please.”)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 24: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Occam’s Razor

Occam’s Razor

Entities should not be multiplied beyond necessity.

Isaac Newton: “We are to admit no more causes of natural thingsthan such as are both true and sufficient to explain theirappearances.”

Diagnostic parsimony: Find the fewest possible causes thatexplain the symptoms.

(Hickam’s dictum: “Patients can have as many diseases as they damnwell please.”)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 25: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 26: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 27: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 28: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 29: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 30: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 31: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 32: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 33: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 34: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 35: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 36: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 37: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 38: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 39: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 40: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 41: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 42: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 43: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 44: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 45: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 46: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 47: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 48: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 49: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 50: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 51: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 52: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 53: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 54: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 55: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 56: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 57: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 58: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 59: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 60: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

HouseRazor

Guessing Game

Teemu Roos Introduction to Information-Theoretic Modeling

Page 61: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

1 Occam’s Razor

2 MDL PrincipleRules & ExceptionsProbabilistic Models

3 Universal Source Coding

4 MDL Principle (contd.)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 62: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

MDL Principle

Minimum Description Length (MDL) Principle (2-part)

Choose the hypothesis which minimizes the sum of

1 the codelength of the hypothesis, and

2 the codelength of the data with the help of the hypothesis.

minh∈H

(`(h) + `(D ; h))

How to encode data with the help of a hypothesis?

Teemu Roos Introduction to Information-Theoretic Modeling

Page 63: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

MDL Principle

Minimum Description Length (MDL) Principle (2-part)

Choose the hypothesis which minimizes the sum of

1 the codelength of the hypothesis, and

2 the codelength of the data with the help of the hypothesis.

minh∈H

(`(h) + `(D ; h))

How to encode data with the help of a hypothesis?

Teemu Roos Introduction to Information-Theoretic Modeling

Page 64: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

MDL Principle

Minimum Description Length (MDL) Principle (2-part)

Choose the hypothesis which minimizes the sum of

1 the codelength of the hypothesis, and

2 the codelength of the data with the help of the hypothesis.

minh∈H

(`(h) + `(D ; h))

How to encode data with the help of a hypothesis?

Teemu Roos Introduction to Information-Theoretic Modeling

Page 65: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 66: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 67: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 68: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 1 :

(n

1

)= 625 2625 ≈ 1.4× 10188.

Codelength log2(n + 1) + log2

(n

k

)≈ 19 vs. log2 2625 = 625

Teemu Roos Introduction to Information-Theoretic Modeling

Page 69: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 2 :

(n

2

)= 195 000 2625 ≈ 1.4× 10188.

Codelength log2(n + 1) + log2

(n

k

)≈ 27 vs. log2 2625 = 625

Teemu Roos Introduction to Information-Theoretic Modeling

Page 70: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 3 :

(n

3

)= 40 495 000 2625 ≈ 1.4× 10188.

Codelength log2(n + 1) + log2

(n

k

)≈ 35 vs. log2 2625 = 625

Teemu Roos Introduction to Information-Theoretic Modeling

Page 71: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 10 :

(n

10

)= 2 331 354 000 000 000 000 000 2625.

Codelength log2(n + 1) + log2

(n

k

)≈ 80 vs. log2 2625 = 625

Teemu Roos Introduction to Information-Theoretic Modeling

Page 72: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 100 :

(n

100

)≈ 9.5× 10117 2625 ≈ 1.4× 10188.

Codelength log2(n + 1) + log2

(n

k

)≈ 401 vs. log2 2625 = 625

Teemu Roos Introduction to Information-Theoretic Modeling

Page 73: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 300 :

(n

300

)≈ 2.7× 10186 < 2625 ≈ 1.4× 10188.

Codelength log2(n + 1) + log2

(n

k

)≈ 629 vs. log2 2625 = 625

Teemu Roos Introduction to Information-Theoretic Modeling

Page 74: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Rules & Exceptions

Idea 1: Hypothesis = rule; encode exceptions.

Black box of size 25× 25 = 625, whitedots at (x1, y1), (x2, y2), (x3, y3).

For image of size n = 625, there are 2n

different images, and(n

k

)=

n!

k!(n − k)!

different groups of k exceptions.

k = 372 :

(n

372

)≈ 5.1× 10181 2625 ≈ 1.4× 10188.

Codelength log2(n + 1) + log2

(n

k

)≈ 613 vs. log2 2625 = 625

Teemu Roos Introduction to Information-Theoretic Modeling

Page 75: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 2: Hypothesis = probability distribution.

Important Observation

Probability distributions are codes are probability distributions!

The code-length of the data is given by

`(D ; h) = log21

ph(D).

Remember to encode distribution too: `(h) > 0.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 76: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 2: Hypothesis = probability distribution.

Important Observation

Probability distributions are codes are probability distributions!

The code-length of the data is given by

`(D ; h) = log21

ph(D).

Remember to encode distribution too: `(h) > 0.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 77: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 2: Hypothesis = probability distribution.

Important Observation

Probability distributions are codes are probability distributions!

The code-length of the data is given by

`(D ; h) = log21

ph(D).

Remember to encode distribution too: `(h) > 0.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 78: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 2: Hypothesis = probability distribution.

Important Observation

Probability distributions are codes are probability distributions!

The code-length of the data is given by

`(D ; h) = log21

ph(D).

Remember to encode distribution too: `(h) > 0.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 79: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 2: Hypothesis = probability distribution.

Important Observation

Probability distributions are codes are probability distributions!

The code-length of the data is given by

`(D ; h) = log21

ph(D).

Remember to encode distribution too: `(h) > 0.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 80: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

MDL & Bayes

The MDL model selection criterion

minimize `(h) + `(D ; h)

can be interpreted (via p = 2−`) as

maximize p(h)× ph(D) .

In Bayesian probability, this is equivalent to maximization ofposterior probability:

p(h | D) =p(h) p(D | h)

p(D),

where the term p(D) (the marginal probability of D) is constantwrt. h and doesn’t affect model selection.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 81: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

MDL & Bayes

The MDL model selection criterion

minimize `(h) + `(D ; h)

can be interpreted (via p = 2−`) as

maximize p(h)× ph(D) .

In Bayesian probability, this is equivalent to maximization ofposterior probability:

p(h | D) =p(h) p(D | h)

p(D),

where the term p(D) (the marginal probability of D) is constantwrt. h and doesn’t affect model selection.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 82: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

MDL & Bayes

“Do not include in the model a new parameter unless there isstrong evidence it is not null.”

Teemu Roos Introduction to Information-Theoretic Modeling

Page 83: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

MDL & Bayes

“Do not include in the model a new parameter unless there isstrong evidence it is not null.”

Teemu Roos Introduction to Information-Theoretic Modeling

Page 84: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

MDL & Bayes

“Do not include in the model a new parameter unless there isstrong evidence it is not null.”

Teemu Roos Introduction to Information-Theoretic Modeling

Page 85: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 3: Hypothesis = set of probability distributions= model class.

Universal Coding

A universal code achieves almost as short a code-length as thecode based on the best distribution in the model class.

Different types of universal codes:

1 two-part code,

2 mixture code,

3 normalized maximum likelihood (NML) code.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 86: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 3: Hypothesis = set of probability distributions= model class.

Universal Coding

A universal code achieves almost as short a code-length as thecode based on the best distribution in the model class.

Different types of universal codes:

1 two-part code,

2 mixture code,

3 normalized maximum likelihood (NML) code.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 87: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 3: Hypothesis = set of probability distributions= model class.

Universal Coding

A universal code achieves almost as short a code-length as thecode based on the best distribution in the model class.

Different types of universal codes:

1 two-part code,

2 mixture code,

3 normalized maximum likelihood (NML) code.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 88: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 3: Hypothesis = set of probability distributions= model class.

Universal Coding

A universal code achieves almost as short a code-length as thecode based on the best distribution in the model class.

Different types of universal codes:

1 two-part code,

2 mixture code,

3 normalized maximum likelihood (NML) code.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 89: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 3: Hypothesis = set of probability distributions= model class.

Universal Coding

A universal code achieves almost as short a code-length as thecode based on the best distribution in the model class.

Different types of universal codes:

1 two-part code,

2 mixture code,

3 normalized maximum likelihood (NML) code.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 90: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 3: Hypothesis = set of probability distributions= model class.

Universal Coding

A universal code achieves almost as short a code-length as thecode based on the best distribution in the model class.

Different types of universal codes:

1 two-part code,

2 mixture code,

3 normalized maximum likelihood (NML) code.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 91: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Rules & ExceptionsProbabilistic Models

Encoding Data: Probabilistic Models

Idea 3: Hypothesis = set of probability distributions= model class.

Universal Coding

A universal code achieves almost as short a code-length as thecode based on the best distribution in the model class.

Different types of universal codes:

1 two-part code,

2 mixture code,

3 normalized maximum likelihood (NML) code.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 92: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

1 Occam’s Razor

2 MDL Principle

3 Universal Source CodingTwo-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

4 MDL Principle (contd.)

Teemu Roos Introduction to Information-Theoretic Modeling

Page 93: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by aBernoulli model with parameter p ∈ [0, 1].

2 However, we do not know p in advance.

3 We’d like to encode data at rate H(p).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 94: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by aBernoulli model with parameter p ∈ [0, 1].

2 However, we do not know p in advance.

3 We’d like to encode data at rate H(p).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 95: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by aBernoulli model with parameter p ∈ [0, 1].

2 However, we do not know p in advance.

3 We’d like to encode data at rate H(p).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 96: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal models

The typical situation might be as follows:

1 We know (think) that the source symbols are generated by aBernoulli model with parameter p ∈ [0, 1].

2 However, we do not know p in advance.

3 We’d like to encode data at rate H(p).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 97: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Two-Part Codes

Let M = pθ : θ ∈ Θ be a parametric probabilistic model class,i.e., a set of distributions pθ indexed by parameter θ.

For any distribution pθ, the Shannon code-lengths satisfy

`θ(D) = log21

pθ(D).

Using parameter value θ, the total code-length becomes

`1(θ) + log21

pθ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 98: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Two-Part Codes

Let M = pθ : θ ∈ Θ be a parametric probabilistic model class,i.e., a set of distributions pθ indexed by parameter θ.

For any distribution pθ, the Shannon code-lengths satisfy

`θ(D) = log21

pθ(D).

Using parameter value θ, the total code-length becomes

`1(θ) + log21

pθ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 99: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Two-Part Codes

Let M = pθ : θ ∈ Θ be a parametric probabilistic model class,i.e., a set of distributions pθ indexed by parameter θ.

For any distribution pθ, the Shannon code-lengths satisfy

`θ(D) = log21

pθ(D).

Using parameter value θ, the total code-length becomes

`1(θ) + log21

pθ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 100: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Information Geometry!

If the points are sufficiently dense (in a code-length sense) then thecode-length for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 101: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Information Geometry!

If the points are sufficiently dense (in a code-length sense) then thecode-length for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 102: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a code-length sense) then thecode-length for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 103: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a code-length sense) then thecode-length for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 104: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a code-length sense) then thecode-length for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 105: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a code-length sense) then thecode-length for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 106: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a code-length sense) then thecode-length for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 107: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a code-length sense) then thecode-length for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 108: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Continuous Parameters

What if the parameters are continuous (like polynomialcoefficients)?

Solution: Quantization. Choose a discrete subset of points,θ(1), θ(2), . . ., and use only them.

Θ

Information Geometry!

If the points are sufficiently dense (in a code-length sense) then thecode-length for data is still almost as short as minθ∈Θ `θ(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 109: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Mixture Universal Model

There are universal codes that are strictly better than the two-partcode.

For instance, given a code for the parameters, let w be adistribution over the parameter space Θ (quantized if necessary)defined as

w(θ) = 2−`(θ) .

Let pw be a mixture distribution over the data-sets D ∈ D,defined as

pw (D) =∑θ∈Θ

pθ(D) w(θ) ,

i.e., an “average” distribution, where each pθ is weighted by w(θ).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 110: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Mixture Universal Model

There are universal codes that are strictly better than the two-partcode.

For instance, given a code for the parameters, let w be adistribution over the parameter space Θ (quantized if necessary)defined as

w(θ) = 2−`(θ) .

Let pw be a mixture distribution over the data-sets D ∈ D,defined as

pw (D) =∑θ∈Θ

pθ(D) w(θ) ,

i.e., an “average” distribution, where each pθ is weighted by w(θ).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 111: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Mixture Universal Model

There are universal codes that are strictly better than the two-partcode.

For instance, given a code for the parameters, let w be adistribution over the parameter space Θ (quantized if necessary)defined as

w(θ) = 2−`(θ) .

Let pw be a mixture distribution over the data-sets D ∈ D,defined as

pw (D) =∑θ∈Θ

pθ(D) w(θ) ,

i.e., an “average” distribution, where each pθ is weighted by w(θ).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 112: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Normalized Maximum Likelihood

Consider again the maximum likelihood model

pθ(D) = maxθ∈Θ

pθ(D) .

It is the best probability assignment achievable under model M.

Unfortunately, it is not possible to use the ML model for codingbecause is not a probability distribution, i.e.,

C =∑D∈D

pθ(D) > 1 ,

unless θ is constant wrt. D.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 113: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Normalized Maximum Likelihood

Consider again the maximum likelihood model

pθ(D) = maxθ∈Θ

pθ(D) .

It is the best probability assignment achievable under model M.

Unfortunately, it is not possible to use the ML model for codingbecause is not a probability distribution, i.e.,

C =∑D∈D

pθ(D) > 1 ,

unless θ is constant wrt. D.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 114: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Normalized Maximum Likelihood

Normalized Maximum Likelihood

The normalized maximum likelihood (NML) model is obtainedby normalizing the ML model:

pnml(D) =pθ(D)

C, where C =

∑D∈D

pθ(D) .

NML code-length:

`(D) = log21

pnml(D)= log2

1

pθ(D)+ log2 C .

The more flexible (complex) the model class, the greater thenormalizing constant.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 115: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Normalized Maximum Likelihood

Normalized Maximum Likelihood

The normalized maximum likelihood (NML) model is obtainedby normalizing the ML model:

pnml(D) =pθ(D)

C, where C =

∑D∈D

pθ(D) .

NML code-length:

`(D) = log21

pnml(D)= log2

1

pθ(D)+ log2 C .

The more flexible (complex) the model class, the greater thenormalizing constant.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 116: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Normalized Maximum Likelihood

Normalized Maximum Likelihood

The normalized maximum likelihood (NML) model is obtainedby normalizing the ML model:

pnml(D) =pθ(D)

C, where C =

∑D∈D

pθ(D) .

NML code-length:

`(D) = log21

pnml(D)= log2

1

pθ(D)+ log2 C .

The more flexible (complex) the model class, the greater thenormalizing constant.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 117: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Normalized Maximum Likelihood0.

00.

10.

20.

30.

40.

5

C = 1.00

Teemu Roos Introduction to Information-Theoretic Modeling

Page 118: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Normalized Maximum Likelihood0.

00.

10.

20.

30.

40.

5

C = 1.92

Teemu Roos Introduction to Information-Theoretic Modeling

Page 119: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Normalized Maximum Likelihood0.

00.

10.

20.

30.

40.

5

C = 2.51

Teemu Roos Introduction to Information-Theoretic Modeling

Page 120: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Normalized Maximum Likelihood0.

00.

20.

40.

60.

81.

0

C = 4.66

Teemu Roos Introduction to Information-Theoretic Modeling

Page 121: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal Models

We have seen three kinds of universal codes:

1 two-part,

2 mixture,

3 NML.

There are also universal codes that are not based on any (explicit)model class: Lempel-Ziv (WinZip, gzip)!

Teemu Roos Introduction to Information-Theoretic Modeling

Page 122: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal Models

We have seen three kinds of universal codes:

1 two-part,

2 mixture,

3 NML.

There are also universal codes that are not based on any (explicit)model class: Lempel-Ziv (WinZip, gzip)!

Teemu Roos Introduction to Information-Theoretic Modeling

Page 123: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Uses of Universal Codes

So what do we do with them?

We can use universal codes for (at least) three purposes:

1 compression,

2 prediction,

3 model selection.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 124: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Uses of Universal Codes

So what do we do with them?

We can use universal codes for (at least) three purposes:

1 compression,

2 prediction,

3 model selection.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 125: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Uses of Universal Codes

So what do we do with them?

We can use universal codes for (at least) three purposes:

1 compression,

2 prediction,

3 model selection.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 126: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Uses of Universal Codes

So what do we do with them?

We can use universal codes for (at least) three purposes:

1 compression,

2 prediction,

3 model selection.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 127: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Uses of Universal Codes

So what do we do with them?

We can use universal codes for (at least) three purposes:

1 compression,

2 prediction,

3 model selection.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 128: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal Prediction

By the connection p(D) = 2−`(D), the following are equivalent:

good compression: `(D) is small,

good predictions: p(Di | D1, . . . ,Di−1) is high for mosti ∈ 1, . . . , n.

For instance, the mixture code gives a natural predictor which isequivalent to Bayesian prediction.

The NML model gives predictions that are good relative to thebest model in the model class, no matter what happens.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 129: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal Prediction

By the connection p(D) = 2−`(D), the following are equivalent:

good compression: `(D) is small,

good predictions: p(Di | D1, . . . ,Di−1) is high for mosti ∈ 1, . . . , n.

For instance, the mixture code gives a natural predictor which isequivalent to Bayesian prediction.

The NML model gives predictions that are good relative to thebest model in the model class, no matter what happens.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 130: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal Prediction

By the connection p(D) = 2−`(D), the following are equivalent:

good compression: `(D) is small,

good predictions: p(Di | D1, . . . ,Di−1) is high for mosti ∈ 1, . . . , n.

For instance, the mixture code gives a natural predictor which isequivalent to Bayesian prediction.

The NML model gives predictions that are good relative to thebest model in the model class, no matter what happens.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 131: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Universal Prediction

By the connection p(D) = 2−`(D), the following are equivalent:

good compression: `(D) is small,

good predictions: p(Di | D1, . . . ,Di−1) is high for mosti ∈ 1, . . . , n.

For instance, the mixture code gives a natural predictor which isequivalent to Bayesian prediction.

The NML model gives predictions that are good relative to thebest model in the model class, no matter what happens.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 132: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Model (Class) Selection

Since a model class that enables good compression of the datamust be based on exploiting the regular features in the data, thecode-length can be used as a yard-stick for comparing modelclasses.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 133: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Two-Part CodesMixture CodesNormalized Maximum LikelihoodUniversal Prediction

Time for a break?

Teemu Roos Introduction to Information-Theoretic Modeling

Page 134: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

1 Occam’s Razor

2 MDL Principle

3 Universal Source Coding

4 MDL Principle (contd.)Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Teemu Roos Introduction to Information-Theoretic Modeling

Page 135: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

MDL Principle

MDL Principle

“Old-style”:

Choose the model pθ ∈M that yields the shortest two-partcode-length

minθ,M

`(M) + `1(θ) + log21

pθ(D).

Modern:

Choose the model class M that yields the shortest universalcode-length

minM

`(M) + `M(D).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 136: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

MDL Model Selection

Intuitive Explanation of MDL

The success in extracting the structure from data can be measuredby the codelength.

We can only extract the structure that is “visible” to the usedmodel class(es). For instance, the Bernoulli (coin flipping) modelonly sees the number of 1s.

When the model can express the structural properties pertaining tothe data but not more, the total code-length is minimal.

Important: Too complex models lead to a long total code-length(Occam’s Razor!).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 137: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

MDL Model Selection

Intuitive Explanation of MDL

The success in extracting the structure from data can be measuredby the codelength.

We can only extract the structure that is “visible” to the usedmodel class(es). For instance, the Bernoulli (coin flipping) modelonly sees the number of 1s.

When the model can express the structural properties pertaining tothe data but not more, the total code-length is minimal.

Important: Too complex models lead to a long total code-length(Occam’s Razor!).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 138: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

MDL Model Selection

Intuitive Explanation of MDL

The success in extracting the structure from data can be measuredby the codelength.

We can only extract the structure that is “visible” to the usedmodel class(es). For instance, the Bernoulli (coin flipping) modelonly sees the number of 1s.

When the model can express the structural properties pertaining tothe data but not more, the total code-length is minimal.

Important: Too complex models lead to a long total code-length(Occam’s Razor!).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 139: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

MDL Model Selection

Intuitive Explanation of MDL

The success in extracting the structure from data can be measuredby the codelength.

We can only extract the structure that is “visible” to the usedmodel class(es). For instance, the Bernoulli (coin flipping) modelonly sees the number of 1s.

When the model can express the structural properties pertaining tothe data but not more, the total code-length is minimal.

Important: Too complex models lead to a long total code-length(Occam’s Razor!).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 140: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Multinomial Models

The multinomial model — the generalization of Bernoulli — isvery simple:

Pr(X = j) = θj , for j ∈ 1, . . . ,m.

Maximum likelihood:

θj =#xi = j

n.

Two-part, mixture, and NML models readily defined.⇒ Exercises 7–9.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 141: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Multinomial Models

The multinomial model — the generalization of Bernoulli — isvery simple:

Pr(X = j) = θj , for j ∈ 1, . . . ,m.

Maximum likelihood:

θj =#xi = j

n.

Two-part, mixture, and NML models readily defined.⇒ Exercises 7–9.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 142: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Multinomial Models

The multinomial model — the generalization of Bernoulli — isvery simple:

Pr(X = j) = θj , for j ∈ 1, . . . ,m.

Maximum likelihood:

θj =#xi = j

n.

Two-part, mixture, and NML models readily defined.

⇒ Exercises 7–9.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 143: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Multinomial Models

The multinomial model — the generalization of Bernoulli — isvery simple:

Pr(X = j) = θj , for j ∈ 1, . . . ,m.

Maximum likelihood:

θj =#xi = j

n.

Two-part, mixture, and NML models readily defined.⇒ Exercises 7–9.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 144: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Fast NML for Multinomials

The naıve way to compute the normalizing constant in the NMLmodel

pθ(xn)

Cmn

, Cmn =

∑yn∈X n

pθ(yn),

takes exponential time (Ω(mn)).

However, there is a trick (Myllymaki & Kontkanen, 2007) to do itin linear time.

Kontkanen & Myllymaki, “A linear-time algorithm for computing the

multinomial stochastic complexity”, Information Processing Letters 103

(2007), 6, pp. 227–233

Teemu Roos Introduction to Information-Theoretic Modeling

Page 145: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Fast NML for Multinomials

The naıve way to compute the normalizing constant in the NMLmodel

pθ(xn)

Cmn

, Cmn =

∑yn∈X n

pθ(yn),

takes exponential time (Ω(mn)).

However, there is a trick (Myllymaki & Kontkanen, 2007) to do itin linear time.

Kontkanen & Myllymaki, “A linear-time algorithm for computing the

multinomial stochastic complexity”, Information Processing Letters 103

(2007), 6, pp. 227–233

Teemu Roos Introduction to Information-Theoretic Modeling

Page 146: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Fast NML for Multinomials

The naıve way to compute the normalizing constant in the NMLmodel

pθ(xn)

Cmn

, Cmn =

∑yn∈X n

pθ(yn),

takes exponential time (Ω(mn)).

However, there is a trick (Myllymaki & Kontkanen, 2007) to do itin linear time.

Kontkanen & Myllymaki, “A linear-time algorithm for computing the

multinomial stochastic complexity”, Information Processing Letters 103

(2007), 6, pp. 227–233

Teemu Roos Introduction to Information-Theoretic Modeling

Page 147: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Histogram Density Estimation

A histogram density is defined by

The break-points between the bins,

The heights of the bins.

Choosing the number and the positions of break-points can bedone by MDL.

The code-length is equivalent (up to additive constants) to thecode-length in a multinomial model.

⇒ Linear-time algorithm can be used.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 148: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Histogram Density Estimation

A histogram density is defined by

The break-points between the bins,

The heights of the bins.

Choosing the number and the positions of break-points can bedone by MDL.

The code-length is equivalent (up to additive constants) to thecode-length in a multinomial model.

⇒ Linear-time algorithm can be used.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 149: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Histogram Density Estimation

A histogram density is defined by

The break-points between the bins,

The heights of the bins.

Choosing the number and the positions of break-points can bedone by MDL.

The code-length is equivalent (up to additive constants) to thecode-length in a multinomial model.

⇒ Linear-time algorithm can be used.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 150: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Histogram Density Estimation

A histogram density is defined by

The break-points between the bins,

The heights of the bins.

Choosing the number and the positions of break-points can bedone by MDL.

The code-length is equivalent (up to additive constants) to thecode-length in a multinomial model.

⇒ Linear-time algorithm can be used.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 151: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Histogram Density Estimation

A histogram density is defined by

The break-points between the bins,

The heights of the bins.

Choosing the number and the positions of break-points can bedone by MDL.

The code-length is equivalent (up to additive constants) to thecode-length in a multinomial model.⇒ Linear-time algorithm can be used.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 152: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Histogram Density Estimation

Teemu Roos Introduction to Information-Theoretic Modeling

Page 153: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Histogram Density Estimation

Teemu Roos Introduction to Information-Theoretic Modeling

Page 154: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Clustering

Consider the problem of clustering vectors of (independent)multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial)models.

Again, the problem is reduced to the multinomial case, and thefast NML algorithm can be applied.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 155: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Clustering

Consider the problem of clustering vectors of (independent)multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial)models.

Again, the problem is reduced to the multinomial case, and thefast NML algorithm can be applied.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 156: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Clustering

Consider the problem of clustering vectors of (independent)multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial)models.

Again, the problem is reduced to the multinomial case, and thefast NML algorithm can be applied.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 157: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Clustering

Consider the problem of clustering vectors of (independent)multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial)models.

Again, the problem is reduced to the multinomial case, and thefast NML algorithm can be applied.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 158: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Clustering

Consider the problem of clustering vectors of (independent)multinomial variables.

This can be seen as a way to encode (compress) the data:

1 first encode the cluster index of each observation vector,

2 then encode the observations using separate (multinomial)models.

Again, the problem is reduced to the multinomial case, and thefast NML algorithm can be applied.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 159: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Clustering

The clustering model can be interpreted as the naıve Bayesstructure:

label = cluster index f1, . . . , fn are features

The structure is very restrictive. Generalization achieved byBayesian networks.

MDL criterion for learning Bayesian network structures again basedon fast NML for multinomials.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 160: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Clustering

The clustering model can be interpreted as the naıve Bayesstructure:

label = cluster index f1, . . . , fn are features

The structure is very restrictive. Generalization achieved byBayesian networks.

MDL criterion for learning Bayesian network structures again basedon fast NML for multinomials.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 161: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Clustering

The clustering model can be interpreted as the naıve Bayesstructure:

label = cluster index f1, . . . , fn are features

The structure is very restrictive. Generalization achieved byBayesian networks.

MDL criterion for learning Bayesian network structures again basedon fast NML for multinomials.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 162: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Age Occupation Climate

Disease

Symptoms

Directed asyclic graph (DAG) describing conditionalindependencies (causality?).

Denote parents by Pai ⊂ X1, . . . ,Xm\Xi .

Joint probability

p(x1, . . . , xm) =m∏

j=1

p(xi | pai ).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 163: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Age Occupation Climate

Disease

Symptoms

Directed asyclic graph (DAG) describing conditionalindependencies (causality?).

Denote parents by Pai ⊂ X1, . . . ,Xm\Xi .

Joint probability

p(x1, . . . , xm) =m∏

j=1

p(xi | pai ).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 164: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Age Occupation Climate

Disease

Symptoms

Directed asyclic graph (DAG) describing conditionalindependencies (causality?).

Denote parents by Pai ⊂ X1, . . . ,Xm\Xi .

Joint probability

p(x1, . . . , xm) =m∏

j=1

p(xi | pai ).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 165: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Problem with NML for Bayesian networks is the summation overall possible data-sets: exponential complexity.

Approximation: normalization over smaller blocks at a time.X2X1 X3X2X1 X3X2X1 X3X2X1 X3

row

s

NML fNML sNML fsNML

⇒ Factorized NML

Teemu Roos Introduction to Information-Theoretic Modeling

Page 166: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Problem with NML for Bayesian networks is the summation overall possible data-sets: exponential complexity.

Approximation: normalization over smaller blocks at a time.X2X1 X3X2X1 X3X2X1 X3X2X1 X3

row

s

NML fNML sNML fsNML

⇒ Factorized NML

Teemu Roos Introduction to Information-Theoretic Modeling

Page 167: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Problem with NML for Bayesian networks is the summation overall possible data-sets: exponential complexity.

Approximation: normalization over smaller blocks at a time.X2X1 X3X2X1 X3X2X1 X3X2X1 X3

row

s

NML fNML sNML fsNML

⇒ Factorized NML

Teemu Roos Introduction to Information-Theoretic Modeling

Page 168: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Teemu Roos Introduction to Information-Theoretic Modeling

Page 169: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Teemu Roos Introduction to Information-Theoretic Modeling

Page 170: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Teemu Roos Introduction to Information-Theoretic Modeling

Page 171: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Bayesian Networks

Teemu Roos Introduction to Information-Theoretic Modeling

Page 172: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we facethe question: How to encode continuous data?

We already know how to encode using models with continuousparameters:

two-part with optimal quantization(≈ k

2 log2 n),

mixture code,

NML.

Obviously not possible to encode data with infinite precision. Haveto discretize: encode x only up to precision δ.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 173: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we facethe question: How to encode continuous data?

We already know how to encode using models with continuousparameters:

two-part with optimal quantization(≈ k

2 log2 n),

mixture code,

NML.

Obviously not possible to encode data with infinite precision. Haveto discretize: encode x only up to precision δ.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 174: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we facethe question: How to encode continuous data?

We already know how to encode using models with continuousparameters:

two-part with optimal quantization(≈ k

2 log2 n),

mixture code,

NML.

Obviously not possible to encode data with infinite precision. Haveto discretize: encode x only up to precision δ.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 175: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we facethe question: How to encode continuous data?

We already know how to encode using models with continuousparameters:

two-part with optimal quantization(≈ k

2 log2 n),

mixture code,

NML.

Obviously not possible to encode data with infinite precision. Haveto discretize: encode x only up to precision δ.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 176: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we facethe question: How to encode continuous data?

We already know how to encode using models with continuousparameters:

two-part with optimal quantization(≈ k

2 log2 n),

mixture code,

NML.

Obviously not possible to encode data with infinite precision. Haveto discretize: encode x only up to precision δ.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 177: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

How to Encode Continuous Data?

In order to encode data using, say, the Gaussian density we facethe question: How to encode continuous data?

We already know how to encode using models with continuousparameters:

two-part with optimal quantization(≈ k

2 log2 n),

mixture code,

NML.

Obviously not possible to encode data with infinite precision. Haveto discretize: encode x only up to precision δ.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 178: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

Recall the Gaussian density function:

φµ,σ2(x1, . . . , xn)(i .i .d .)

=(2πσ2

)−n/2e−∑n

i=1(xi − µ)2

2σ2 .

The code-length is then

n

2log2(2πσ2)− 1

2σ2

n∑i=1

(xi − µ)2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 179: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

Recall the Gaussian density function:

φµ,σ2(x1, . . . , xn)(i .i .d .)

=(2πσ2

)−n/2e−∑n

i=1(xi − µ)2

2σ2 .

The code-length is then

n

2log2(2πσ2)− 1

2σ2

n∑i=1

(xi − µ)2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 180: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

Ok, we have our Gaussian code-length formula:

n

2log2(2πσ2)− 1

2σ2

n∑i=1

(xi − µ)2.

Let’s use the two-part code and plug in the maximum likelihoodparameters:

µ =1

n

n∑i=1

xi , σ2 =1

n

n∑i=1

(xi − µ)2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 181: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

Ok, we have our Gaussian code-length formula:

n

2log2(2πσ2)− 1

2σ2

n∑i=1

(xi − µ)2.

Let’s use the two-part code and plug in the maximum likelihoodparameters:

µ =1

n

n∑i=1

xi , σ2 =1

n

n∑i=1

(xi − µ)2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 182: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

Ok, we have our Gaussian code-length formula:

n

2log2(2πσ2)− 1

2σ2

n∑i=1

(xi − µ)2.

Let’s use the two-part code and plug in the maximum likelihoodparameters:

µ =1

n

n∑i=1

xi , σ2 =1

n

n∑i=1

(xi − µ)2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 183: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

Ok, we have our Gaussian code-length formula:

n

2log2(2πσ2)− 1

2σ2

n∑i=1

(xi − µ)2.

Let’s use the two-part code and plug in the maximum likelihoodparameters:

µ =1

n

n∑i=1

xi , σ2 =1

n

n∑i=1

(xi − µ)2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 184: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

Ok, we have our Gaussian code-length formula:

n

2log2(2πσ2)− n

2.

Let’s use the two-part code and plug in the maximum likelihoodparameters:

µ =1

n

n∑i=1

xi , σ2 =1

n

n∑i=1

(xi − µ)2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 185: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

Ok, we have our Gaussian code-length formula:

n

2log2 σ

2 + constant.

Let’s use the two-part code and plug in the maximum likelihoodparameters:

µ =1

n

n∑i=1

xi , σ2 =1

n

n∑i=1

(xi − µ)2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 186: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

We get the total (two-part) code-length formula:

n

2log2 σ

2 +k

2log2 n + constant.

Since we have two parameters, µ and σ2, we let k = 2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 187: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

We get the total (two-part) code-length formula:

n

2log2 σ

2 +k

2log2 n + constant.

Since we have two parameters, µ and σ2, we let k = 2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 188: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Gaussian Density

We get the total (two-part) code-length formula:

n

2log2 σ

2 +2

2log2 n + constant.

Since we have two parameters, µ and σ2, we let k = 2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 189: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

A similar treatment can be given to linear regression models.

The model includes a set of regressor variables x1, . . . , xp ∈ R, anda set of coefficients β1, . . . , βp.

The dependent variable, Y , is assumed to be Gaussian:

the mean µ is given as a linear combination of the regressors:

µ = β1x1 + · · ·+ βpxp = β′x ,

variance is some parameter σ2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 190: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

A similar treatment can be given to linear regression models.

The model includes a set of regressor variables x1, . . . , xp ∈ R, anda set of coefficients β1, . . . , βp.

The dependent variable, Y , is assumed to be Gaussian:

the mean µ is given as a linear combination of the regressors:

µ = β1x1 + · · ·+ βpxp = β′x ,

variance is some parameter σ2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 191: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

A similar treatment can be given to linear regression models.

The model includes a set of regressor variables x1, . . . , xp ∈ R, anda set of coefficients β1, . . . , βp.

The dependent variable, Y , is assumed to be Gaussian:

the mean µ is given as a linear combination of the regressors:

µ = β1x1 + · · ·+ βpxp = β′x ,

variance is some parameter σ2.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 192: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

For a sample of size n, the matrix notation is convenient:

Y =

Y1...

Yn

X =

x11 · · · x1p...

. . ....

xn1 · · · xnp

β =

β1...βp

ε =

ε1...εn

Then the model can be written as

Y = Xβ + ε,

where εi ∼ N (0, σ2).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 193: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

For a sample of size n, the matrix notation is convenient:

Y =

Y1...

Yn

X =

x11 · · · x1p...

. . ....

xn1 · · · xnp

β =

β1...βp

ε =

ε1...εn

Then the model can be written as

Y = Xβ + ε,

where εi ∼ N (0, σ2).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 194: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

The maximum likelihood estimators are now

β = (X ′X )−1X ′Y , σ2 =1

n‖Y − X β‖2

2 =RSS

n,

where RSS is the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formulaapplies:

n

2log2 +

2log2 n + constant.

The number of parameters is now p + 1 (p of the βs and σ2), sowe get...

Teemu Roos Introduction to Information-Theoretic Modeling

Page 195: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

The maximum likelihood estimators are now

β = (X ′X )−1X ′Y , σ2 =1

n‖Y − X β‖2

2 =RSS

n,

where RSS is the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formulaapplies:

n

2log2 σ

2 +k

2log2 n + constant.

The number of parameters is now p + 1 (p of the βs and σ2), sowe get...

Teemu Roos Introduction to Information-Theoretic Modeling

Page 196: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

The maximum likelihood estimators are now

β = (X ′X )−1X ′Y , σ2 =1

n‖Y − X β‖2

2 =RSS

n,

where RSS is the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formulaapplies:

n

2log2 RSS +

k

2log2 n + constant.

The number of parameters is now p + 1 (p of the βs and σ2), sowe get...

Teemu Roos Introduction to Information-Theoretic Modeling

Page 197: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

The maximum likelihood estimators are now

β = (X ′X )−1X ′Y , σ2 =1

n‖Y − X β‖2

2 =RSS

n,

where RSS is the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formulaapplies:

n

2log2 RSS +

k

2log2 n + constant.

The number of parameters is now p + 1 (p of the βs and σ2), sowe get...

Teemu Roos Introduction to Information-Theoretic Modeling

Page 198: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Linear Regression

The maximum likelihood estimators are now

β = (X ′X )−1X ′Y , σ2 =1

n‖Y − X β‖2

2 =RSS

n,

where RSS is the “residual sum of squares”.

Since the errors are assumed Gaussian, our code-length formulaapplies:

n

2log2 RSS +

p + 1

2log2 n + constant.

The number of parameters is now p + 1 (p of the βs and σ2), sowe get...

Teemu Roos Introduction to Information-Theoretic Modeling

Page 199: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Subset Selection Problem

Often we have a large set of potential regressors, some of whichmay be irrelevant.

The MDL principle can be used to select a subset of them bycomparing the total code-lengths:

minS

[n

2log2 RSSS +

|S |+ 1

2log2 n

],

where RSSS is the RSS obtained by using subset S of theregressors.

⇒ Exercise 10.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 200: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Subset Selection Problem

Often we have a large set of potential regressors, some of whichmay be irrelevant.

The MDL principle can be used to select a subset of them bycomparing the total code-lengths:

minS

[n

2log2 RSSS +

|S |+ 1

2log2 n

],

where RSSS is the RSS obtained by using subset S of theregressors.

⇒ Exercise 10.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 201: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Subset Selection Problem

Often we have a large set of potential regressors, some of whichmay be irrelevant.

The MDL principle can be used to select a subset of them bycomparing the total code-lengths:

minS

[n

2log2 RSSS +

|S |+ 1

2log2 n

],

where RSSS is the RSS obtained by using subset S of theregressors.

⇒ Exercise 10.

Teemu Roos Introduction to Information-Theoretic Modeling

Page 202: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Denoising

Complexity = Information + Noise= Regularity + Randomness= Algorithm + Compressed file

Denoising means the process of removing noise from a signal.

The MDL principle gives a natural method for denoising since thevery idea of MDL is to separate the total complexity of a signalinto information and noise.

First encode a smooth signal (information), and then the differenceto the observed signal (noise).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 203: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Denoising

Complexity = Information + Noise= Regularity + Randomness= Algorithm + Compressed file

Denoising means the process of removing noise from a signal.

The MDL principle gives a natural method for denoising since thevery idea of MDL is to separate the total complexity of a signalinto information and noise.

First encode a smooth signal (information), and then the differenceto the observed signal (noise).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 204: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Denoising

Complexity = Information + Noise= Regularity + Randomness= Algorithm + Compressed file

Denoising means the process of removing noise from a signal.

The MDL principle gives a natural method for denoising since thevery idea of MDL is to separate the total complexity of a signalinto information and noise.

First encode a smooth signal (information), and then the differenceto the observed signal (noise).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 205: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Denoising

Complexity = Information + Noise= Regularity + Randomness= Algorithm + Compressed file

Denoising means the process of removing noise from a signal.

The MDL principle gives a natural method for denoising since thevery idea of MDL is to separate the total complexity of a signalinto information and noise.

First encode a smooth signal (information), and then the differenceto the observed signal (noise).

Teemu Roos Introduction to Information-Theoretic Modeling

Page 206: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Wavelet Denoising

One particularly useful way to obtain the regressor (design) matrixis to use wavelets.

Image by Gabriel Peyre

Teemu Roos Introduction to Information-Theoretic Modeling

Page 207: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Wavelet Denoising

One particularly useful way to obtain the regressor (design) matrixis to use wavelets.

Image by Gabriel Peyre

Teemu Roos Introduction to Information-Theoretic Modeling

Page 208: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Wavelet Denoising

Teemu Roos Introduction to Information-Theoretic Modeling

Page 209: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Example: Denoising

Noisy PSNR=19.8 MDL (A-B) PSNR=32.9

Teemu Roos Introduction to Information-Theoretic Modeling

Page 210: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Example: Denoising

Teemu Roos Introduction to Information-Theoretic Modeling

Page 211: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Example: Denoising

Teemu Roos Introduction to Information-Theoretic Modeling

Page 212: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Example: Denoising

Teemu Roos Introduction to Information-Theoretic Modeling

Page 213: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Example: Denoising

Teemu Roos Introduction to Information-Theoretic Modeling

Page 214: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Example: Denoising

Teemu Roos Introduction to Information-Theoretic Modeling

Page 215: Introduction to Information-Theoretic Modeling

Occam’s RazorMDL Principle

Universal Source CodingMDL Principle (contd.)

Modern MDLHistogram Density EstimationClusteringLinear RegressionWavelet Denoising

Last Slide

Thanks for listening!

Teemu Roos Introduction to Information-Theoretic Modeling