40
Introduction Machine Learning I CSE 6740, Fall 2013 Le Song

Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Introduction

Machine Learning I CSE 6740, Fall 2013

Le Song

Page 2: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

What is Machine Learning (ML)

Study of algorithms that improve their performance at some task with experience

2

Page 3: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Common to Industrial scale problems

3

13 million wikipedia pages

800 million users

6 billion photos

24 hours video uploaded per minutes

> 1 trillion webpages

Page 4: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Organizing Images

4

Image Databases

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 5: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Visualize Image Relations

5

Each image has thousands or millions of pixels.

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 6: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Organizing documents

6

Reading, digesting, and categorizing a vast text database is too much for human!

We want:

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 7: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Weather Prediction

7

Pre

dict

Predict

Numeric values: 40 F Wind: NE at 14 km/h Humidity: 83%

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 8: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Face Detection

8

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 9: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Understanding brain activity

9

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 10: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Product Recommendation

10

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 11: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Handwritten digit recognition/text annotation

11

Inter-character dependency

Inter-word dependency

Aoccdrnig to a sudty at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a ttoal mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 12: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Similar problem: speech recognition

12

“Machine Learning is the preferred method for speech recognition …”

Audio signals

Text

Models Hidden Markov Models

Page 13: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Similar problem: bioinformatics

13

tgtatcccagcatattttacgtaaaaacaaaacggtaatgcgaacataacttatttattggggcccggaccgcaaaccggccaaacgcgtttgcacccataaaaacataagggcaacaaaaaaattgttaagctgttgtttatttttgcaatcgaaaatc

gaaagcgctcaaatagctgcgatcactcgggagcagggtaaagtcgcctcgaaacaggaagctgaagcatcttctataaatacactcaaagcgatcattccgaggcgagtctggttagaaatttacatggactgcaaaaaggtatagccccacaaactgc

gtctccacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgcgga

tttaaggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttggcgg

agttcattaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgtaaat

tgaaactgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggcaacg

acagccagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtcaag

accgcactccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtcaacc

attcatattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccataacc

atacccatatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccgtgt

gtaagtggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggcagc

aggccaagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgtggc

gactttctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgttttc

gggtgtacgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactataat

cttgcagtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagcgag

agggagagcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagattgta

tctgagagtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaatagaaat

taaaaaccaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgagccg

cgggattactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtagaag

tcggtaaaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattgagg

tgcggattgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtactta

tgggaaagcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttctgg

gattattcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctcaca

ggatgttgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtccgg

acggcgtcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacgtta

agttcctgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctcwhereisthegenetgattaaaaatatcctttaagaaagcccatgggtataacttacg

tgctcactgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcgaaa

cagacaggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtgcaa

agcagataaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataaatt

tgtatcccagcatattttacgtaaaaacaaaacggtaatgcgaacataacttatttattggggcccggaccgcaaaccggccaaacgcgtttgcacccataaaaacataagggcaacaaaaaaattgttaagctgttgtttatttttgcaatcgaaaatc

gaaagcgctcaaatagctgcgatcactcgggagcagggtaaagtcgcctcgaaacaggaagctgaagcatcttctataaatacactcaaagcgatcattccgaggcgagtctggttagaaatttacatggactgcaaaaaggtatagccccacaaactgc

gtctccacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgcgga

tttaaggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttggcgg

agttcattaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgtaaat

tgaaactgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggcaacg

acagccagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtcaag

accgcactccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtcaacc

attcatattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccataacc

atacccatatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccgtgt

gtaagtggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggcagc

aggccaagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgtggc

gactttctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgttttc

gggtgtacgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactataat

cttgcagtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagcgag

agggagagcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagattgta

tctgagagtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaatagaaat

taaaaaccaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgagccg

cgggattactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtagaag

tcggtaaaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattgagg

tgcggattgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtactta

tgggaaagcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttctgg

gattattcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctcaca

ggatgttgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtccgg

acggcgtcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacgtta

agttcctgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctctgaaaacagttcatggtttaaaaatatcctttaagaaagcccatgggtataacttacg

tgctcactgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcgaaa

cagacaggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtgcaa

agcagataaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataaatt

Where is the gene?

Page 14: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Spam Filtering

14

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 15: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Similar problem: webpage classification

15

Company homepage vs. University homepage

Page 16: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Robot Control

16

Now cars can find their own ways!

What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?

Page 17: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Nonlinear classifier

17

Nonlinear Decision Boundaries

Linear SVM Decision Boundaries

Page 18: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Nonconventional clusters

18

Need more advanced methods, such as kernel methods or spectral clustering to work

Page 19: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Syllabus

Cover a number of most commonly used machine learning algorithms in sufficient amount of details in their mechanisms.

Organization

Unsupervised learning (data exploration)

Clustering, dimensionality reduction, density estimation, novelty detection

Supervised learning (predictive models)

Classifications, regressions

Complex models (dealing with nonlinearity, combine models etc)

Kernel methods, graphical models, boosting

19

Page 20: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Prerequisites

Probabilities

Distributions, densities, marginalization, conditioning ….

Basic statistics

Moments, classification, regression, maximum likelihood estimation

Algorithms

Dynamic programming, basic data structures, complexity

Programming

Mostly your choice of language, but Matlab will be very useful

The class will be fast paced

Ability to deal with abstract mathematical concepts

20

Page 21: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Textbooks

Textbooks:

Pattern Recognition and Machine Learning, Chris Bishop

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, Jerome Friedman

Machine Learning, Tom Mitchell

21

Page 22: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Grading

6 assignments (60%)

Approximately 1 assignment every 4 lectures

Start early

Midterm exam (20%)

Final exam (20%)

Project for advanced students

Can be used to replace exams

Based on student experience and lecturer interests.

22

Page 23: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Homeworks

Zero credit after each deadline

All homeworks must be handed in, even for zero credit

Collaboration

You may discuss the questions

Each student writes their own answers

Write on your homework anyone with whom you collaborate

Each student must write their own codes for the programming part

23

Page 24: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Staff

Instructor: Le Song, Klaus 1340

TA: Joonseok, Klaus 1305, Nan Du, Klaus 1305

Guest Lecturer: TBD

Administrative Assistant: Mimi Haley, Klaus 1321

Mailing list: [email protected]

More information:

http://www.cc.gatech.edu/~lsong/teaching/CSE6740fall13.html

24

Page 25: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Today

Probabilities

Independence

Conditional Independence

25

Page 26: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Random Variables (RV)

Data may contain many different attributes

Age, grade, color, location, coordinate, time …

Upper-case for random variables (eg. 𝑋, 𝑌), lower-case for values (eg. 𝑥, 𝑦)

𝑃(𝑋) for distribution, 𝑝(𝑋) for density

𝑉𝑎𝑙(𝑋) = possible values of random variable 𝑋

For discrete (categorical): 𝑃(𝑋 = 𝑥𝑖)𝑖=1…|𝑉𝑎𝑙(𝑋)| = 1

For continuous: 𝑝 𝑋 = 𝑥 𝑑𝑥 = 1𝑉𝑎𝑙 𝑋

𝑃 𝑥 ≥ 0

Shorthand: 𝑃(𝑥) for 𝑃(𝑋 = 𝑥) 26

Page 27: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Interpretations of probability

Frequentists

𝑃(𝑥) is the frequency of 𝑥 in the limit

Many arguments against this interpretation

What is the frequency of the event “it will rain tomorrow?”

Subjective interpretation

𝑃(𝑥) is my degree of belief that 𝑥 will happen

What does “degree of belief mean”?

If 𝑃(𝑥) = 0.8, then I am willing to bet

For this class, we don’t care the type of interpretation.

27

Page 28: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Conditional probability

After we have seen 𝑥, how do we feel 𝑦 will happen?

𝑃 𝑦 𝑥 means 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)

A conditional distribution are a family of distributions

For each 𝑋 = 𝑥, it is a distribution 𝑃(𝑌|𝑥)

28

Page 29: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Two of the most important rules: I. The chain rule

𝑃 𝑦, 𝑥 = 𝑃 𝑦 𝑥 𝑃 𝑥

More generally:

𝑃 𝑥1, 𝑥2, … , 𝑥𝑘 = 𝑃 𝑥1 𝑃 𝑥2 𝑥1 … 𝑃 𝑥𝑘 𝑥𝑘−1, … , 𝑥2, 𝑥1

29

Page 30: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Two of the most important rules: II. Bayes rule

𝑃 𝑦 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦

𝑃 𝑥=

𝑃(𝑥,𝑦)

𝑃(𝑥,𝑦)𝑥∈Val 𝑋

More generally, additional variable z:

𝑃(𝑦|𝑥, 𝑧) = 𝑃 𝑥 𝑦,𝑧 𝑃(𝑦|𝑧)

𝑃(𝑥|𝑧)

30

posterior

likelihood Prior

Normalization constant

Page 31: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Independence

𝑋 and 𝑌 independent, if 𝑃(𝑌|𝑋) = 𝑃(𝑌)

𝑃 𝑌 𝑋 = 𝑃 𝑌 → (𝑋 ⊥ 𝑌)

Proposition: 𝑋 and 𝑌 independent if and only if 𝑃(𝑋, 𝑌) = 𝑃(𝑋)𝑃(𝑌)

31

Page 32: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Conditional independence

Independence is rarely true; conditional independence is more prevalent

𝑋 and 𝑌 conditionally independent given Z if

𝑃(𝑌|𝑋, 𝑍) = 𝑃(𝑌|𝑍)

𝑃(𝑌|𝑋, 𝑍) = 𝑃(𝑌|𝑍) → (𝑋 ⊥ 𝑌 | 𝑍)

𝑋 ⊥ 𝑌 𝑍 if and only if 𝑃 𝑋, 𝑌 𝑍 = 𝑃 𝑋 𝑍 𝑃 𝑌 𝑍

32

Page 33: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Joint distribution, marginalization

Two random variables – Grades (G) & Intelligence (I)

𝑃 𝐺, 𝐼 =

For 𝑛 binary variables, the table (multiway array) gets really big

𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 has 2𝑛 entries!

Marginalization – Compute marginal over a single variable

𝑃(𝐺 = 𝐵) = 𝑃(𝐺 = 𝐵, 𝐼 = 𝑉𝐻) + 𝑃(𝐺 = 𝐵, 𝐼 = 𝐻) = 0.2

33

G I VH H

A 0.7 0.1

B 0.15 0.05

Page 34: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Marginalization – the general case

Compute marginal distribution 𝑃 𝑋𝑖 from 𝑃 𝑋1, 𝑋2, … , 𝑋𝑖 , 𝑋𝑖+1, … , 𝑋𝑛

𝑃 𝑋1, 𝑋2, … , 𝑋𝑖 = 𝑃 𝑋1, 𝑋2, … , 𝑋𝑖 , 𝑥𝑖+1, … 𝑥𝑛

𝑥𝑖+1,…,𝑥𝑛

𝑃 𝑋𝑖 = 𝑃 𝑥1, … , 𝑥𝑖−1, 𝑋𝑖

𝑥1,…,𝑥𝑖−1

34

If binary variables, need to sum over 2𝑛−1 terms!

Page 35: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Example problem

Estimate the probability 𝜃 of landing in heads using a biased coin

Given a sequence of 𝑁 independently and identically distributed (iid) flips

Eg., 𝐷 = 𝑥1, 𝑥2, … , 𝑥𝑁 = {1,0,1,… , 0}, 𝑥𝑖 ∈ {0,1}

Model: 𝑃 𝑥|𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥

𝑃(𝑥|𝜃 ) = 1 − 𝜃, 𝑓𝑜𝑟 𝑥 = 0𝜃, 𝑓𝑜𝑟 𝑥 = 1

Likelihood of a single observation 𝑥𝑖 ? 𝑃 𝑥𝑖|𝜃 = 𝜃𝑥𝑖 1 − 𝜃 1−𝑥𝑖

35

Page 36: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Frequentist Parameter Estimation

Frequentists think of a parameter as a fixed, unknown constant, not a random variable

Hence different “objective” estimators, instead of Bayes’ rule

These estimators have different properties, such as being “unbiased”, “minimum variance”, etc.

A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties

𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑃 𝐷 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃(𝑥𝑖|𝜃)𝑁𝑖=1

36

Page 37: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

MLE for Biased Coin

Objective function, log likelihood

𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = log 𝜃𝑛ℎ 1 − 𝜃 𝑛𝑡 = 𝑛ℎ log 𝜃 +𝑁 − 𝑛ℎ log 1 − 𝜃

We need to maximize this w.r.t. 𝜃

Take derivatives w.r.t. 𝜃

𝜕𝑙

𝜕𝜃=

𝑛ℎ

𝜃−

𝑁−𝑛ℎ

1−𝜃= 0 ⇒ 𝜃 𝑀𝐿𝐸 =

𝑛ℎ

𝑁 or 𝜃 𝑀𝐿𝐸 =

1

𝑁 𝑥𝑖𝑖

37

Page 38: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Bayesian Parameter Estimation

Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule:

𝑃(𝜃|𝐷) =𝑃 𝐷 𝜃 𝑃(𝜃)

𝑃(𝐷)=

𝑃 𝐷 𝜃 𝑃(𝜃)

𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃

The crucial equation can be written in words

𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑×𝑝𝑟𝑖𝑜𝑟

𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑

For iid data, the likelihood is 𝑃 𝐷 𝜃 = 𝑃(𝑥𝑖|𝜃)𝑁𝑖=1

𝜃𝑥𝑖 1 − 𝜃 1−𝑥𝑖 = 𝜃 𝑥𝑖𝑖 1 − 𝜃𝑁𝑖=1

1−𝑥𝑖𝑖 = 𝜃#ℎ𝑒𝑎𝑑 1 − 𝜃 #𝑡𝑎𝑖𝑙

The prior 𝑃 𝜃 encodes our prior knowledge on the domain

Different prior 𝑃 𝜃 will end up with different estimate 𝑃(𝜃|𝐷)!

38

𝜃

𝑋

𝑁

Page 39: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Prior over 𝜃, Beta distribution

𝑃 𝜃; 𝛼, 𝛽 =Γ 𝛼+𝛽

Γ 𝑎 Γ 𝛽𝜃𝛼−1 1 − 𝜃 𝛽−1

When x is discrete Γ 𝑥 + 1 = 𝑥Γ 𝑥 = 𝑥!

Posterior distribution of 𝜃

𝑃 𝜃|𝑥1, … , 𝑥𝑁 =𝑃 𝑥1,…,𝑥𝑁 𝜃 𝑃 𝜃

𝑃 𝑥1,…,𝑥𝑁∝

𝜃𝑛ℎ 1 − 𝜃 𝑛𝑡𝜃𝛼−1 1 − 𝜃 𝛽−1 = 𝜃𝑛ℎ+𝛼−1 1 − 𝜃 𝑛𝑡+𝛽−1

𝛼 and 𝛽 are hyperparameters and correspond to the number of “virtual” heads and tails (pseudo counts)

39

Bayesian estimation for biased coin

Page 40: Modeling Rich Structured Data via Kernel Distribution ...lsong/teaching/CSE6740fall13/lecture0.pdfWhat is Machine Learning (ML) Study of algorithms that improve their performance at

Bayesian Estimation for Bernoulli

Posterior distribution 𝜃

𝑃 𝜃|𝑥1, … , 𝑥𝑁 =𝑃 𝑥1,…,𝑥𝑁 𝜃 𝑃 𝜃

𝑃 𝑥1,…,𝑥𝑁∝

𝜃𝑛ℎ 1 − 𝜃 𝑛𝑡𝜃𝛼−1 1 − 𝜃 𝛽−1 = 𝜃𝑛ℎ+𝛼−1 1 − 𝜃 𝑛𝑡+𝛽−1

Posterior mean estimation:

𝜃𝑏𝑎𝑦𝑒𝑠 = 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 = 𝐶 𝜃 × 𝜃𝑛ℎ+𝛼−1 1 − 𝜃 𝑛𝑡+𝛽−1𝑑𝜃 =(𝑛ℎ+𝛼)

𝑁+𝛼+𝛽

Prior strength: 𝐴 = 𝛼 + 𝛽

A can be interpreted as an imaginary dataset

40