Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Introduction
Machine Learning I CSE 6740, Fall 2013
Le Song
What is Machine Learning (ML)
Study of algorithms that improve their performance at some task with experience
2
Common to Industrial scale problems
3
13 million wikipedia pages
800 million users
6 billion photos
24 hours video uploaded per minutes
> 1 trillion webpages
Organizing Images
4
Image Databases
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Visualize Image Relations
5
Each image has thousands or millions of pixels.
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Organizing documents
6
Reading, digesting, and categorizing a vast text database is too much for human!
We want:
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Weather Prediction
7
Pre
dict
Predict
Numeric values: 40 F Wind: NE at 14 km/h Humidity: 83%
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Face Detection
8
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Understanding brain activity
9
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Product Recommendation
10
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Handwritten digit recognition/text annotation
11
Inter-character dependency
Inter-word dependency
Aoccdrnig to a sudty at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a ttoal mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Similar problem: speech recognition
12
“Machine Learning is the preferred method for speech recognition …”
Audio signals
Text
Models Hidden Markov Models
Similar problem: bioinformatics
13
tgtatcccagcatattttacgtaaaaacaaaacggtaatgcgaacataacttatttattggggcccggaccgcaaaccggccaaacgcgtttgcacccataaaaacataagggcaacaaaaaaattgttaagctgttgtttatttttgcaatcgaaaatc
gaaagcgctcaaatagctgcgatcactcgggagcagggtaaagtcgcctcgaaacaggaagctgaagcatcttctataaatacactcaaagcgatcattccgaggcgagtctggttagaaatttacatggactgcaaaaaggtatagccccacaaactgc
gtctccacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgcgga
tttaaggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttggcgg
agttcattaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgtaaat
tgaaactgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggcaacg
acagccagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtcaag
accgcactccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtcaacc
attcatattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccataacc
atacccatatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccgtgt
gtaagtggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggcagc
aggccaagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgtggc
gactttctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgttttc
gggtgtacgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactataat
cttgcagtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagcgag
agggagagcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagattgta
tctgagagtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaatagaaat
taaaaaccaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgagccg
cgggattactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtagaag
tcggtaaaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattgagg
tgcggattgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtactta
tgggaaagcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttctgg
gattattcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctcaca
ggatgttgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtccgg
acggcgtcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacgtta
agttcctgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctcwhereisthegenetgattaaaaatatcctttaagaaagcccatgggtataacttacg
tgctcactgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcgaaa
cagacaggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtgcaa
agcagataaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataaatt
tgtatcccagcatattttacgtaaaaacaaaacggtaatgcgaacataacttatttattggggcccggaccgcaaaccggccaaacgcgtttgcacccataaaaacataagggcaacaaaaaaattgttaagctgttgtttatttttgcaatcgaaaatc
gaaagcgctcaaatagctgcgatcactcgggagcagggtaaagtcgcctcgaaacaggaagctgaagcatcttctataaatacactcaaagcgatcattccgaggcgagtctggttagaaatttacatggactgcaaaaaggtatagccccacaaactgc
gtctccacatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgcgga
tttaaggcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttggcgg
agttcattaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgtaaat
tgaaactgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggcaacg
acagccagcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtcaag
accgcactccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtcaacc
attcatattcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccataacc
atacccatatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccgtgt
gtaagtggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggcagc
aggccaagattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgtggc
gactttctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgttttc
gggtgtacgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactataat
cttgcagtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagcgag
agggagagcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagattgta
tctgagagtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaatagaaat
taaaaaccaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgagccg
cgggattactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtagaag
tcggtaaaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattgagg
tgcggattgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtactta
tgggaaagcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttctgg
gattattcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctcaca
ggatgttgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtccgg
acggcgtcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacgtta
agttcctgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctctgaaaacagttcatggtttaaaaatatcctttaagaaagcccatgggtataacttacg
tgctcactgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcgaaa
cagacaggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtgcaa
agcagataaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataaatt
Where is the gene?
Spam Filtering
14
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Similar problem: webpage classification
15
Company homepage vs. University homepage
Robot Control
16
Now cars can find their own ways!
What are the desired outcomes? What are the inputs (data)? What are the learning paradigms?
Nonlinear classifier
17
Nonlinear Decision Boundaries
Linear SVM Decision Boundaries
Nonconventional clusters
18
Need more advanced methods, such as kernel methods or spectral clustering to work
Syllabus
Cover a number of most commonly used machine learning algorithms in sufficient amount of details in their mechanisms.
Organization
Unsupervised learning (data exploration)
Clustering, dimensionality reduction, density estimation, novelty detection
Supervised learning (predictive models)
Classifications, regressions
Complex models (dealing with nonlinearity, combine models etc)
Kernel methods, graphical models, boosting
19
Prerequisites
Probabilities
Distributions, densities, marginalization, conditioning ….
Basic statistics
Moments, classification, regression, maximum likelihood estimation
Algorithms
Dynamic programming, basic data structures, complexity
Programming
Mostly your choice of language, but Matlab will be very useful
The class will be fast paced
Ability to deal with abstract mathematical concepts
20
Textbooks
Textbooks:
Pattern Recognition and Machine Learning, Chris Bishop
The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, Jerome Friedman
Machine Learning, Tom Mitchell
21
Grading
6 assignments (60%)
Approximately 1 assignment every 4 lectures
Start early
Midterm exam (20%)
Final exam (20%)
Project for advanced students
Can be used to replace exams
Based on student experience and lecturer interests.
22
Homeworks
Zero credit after each deadline
All homeworks must be handed in, even for zero credit
Collaboration
You may discuss the questions
Each student writes their own answers
Write on your homework anyone with whom you collaborate
Each student must write their own codes for the programming part
23
Staff
Instructor: Le Song, Klaus 1340
TA: Joonseok, Klaus 1305, Nan Du, Klaus 1305
Guest Lecturer: TBD
Administrative Assistant: Mimi Haley, Klaus 1321
Mailing list: [email protected]
More information:
http://www.cc.gatech.edu/~lsong/teaching/CSE6740fall13.html
24
Today
Probabilities
Independence
Conditional Independence
25
Random Variables (RV)
Data may contain many different attributes
Age, grade, color, location, coordinate, time …
Upper-case for random variables (eg. 𝑋, 𝑌), lower-case for values (eg. 𝑥, 𝑦)
𝑃(𝑋) for distribution, 𝑝(𝑋) for density
𝑉𝑎𝑙(𝑋) = possible values of random variable 𝑋
For discrete (categorical): 𝑃(𝑋 = 𝑥𝑖)𝑖=1…|𝑉𝑎𝑙(𝑋)| = 1
For continuous: 𝑝 𝑋 = 𝑥 𝑑𝑥 = 1𝑉𝑎𝑙 𝑋
𝑃 𝑥 ≥ 0
Shorthand: 𝑃(𝑥) for 𝑃(𝑋 = 𝑥) 26
Interpretations of probability
Frequentists
𝑃(𝑥) is the frequency of 𝑥 in the limit
Many arguments against this interpretation
What is the frequency of the event “it will rain tomorrow?”
Subjective interpretation
𝑃(𝑥) is my degree of belief that 𝑥 will happen
What does “degree of belief mean”?
If 𝑃(𝑥) = 0.8, then I am willing to bet
For this class, we don’t care the type of interpretation.
27
Conditional probability
After we have seen 𝑥, how do we feel 𝑦 will happen?
𝑃 𝑦 𝑥 means 𝑃(𝑌 = 𝑦|𝑋 = 𝑥)
A conditional distribution are a family of distributions
For each 𝑋 = 𝑥, it is a distribution 𝑃(𝑌|𝑥)
28
Two of the most important rules: I. The chain rule
𝑃 𝑦, 𝑥 = 𝑃 𝑦 𝑥 𝑃 𝑥
More generally:
𝑃 𝑥1, 𝑥2, … , 𝑥𝑘 = 𝑃 𝑥1 𝑃 𝑥2 𝑥1 … 𝑃 𝑥𝑘 𝑥𝑘−1, … , 𝑥2, 𝑥1
29
Two of the most important rules: II. Bayes rule
𝑃 𝑦 𝑥 = 𝑃 𝑥 𝑦 𝑃 𝑦
𝑃 𝑥=
𝑃(𝑥,𝑦)
𝑃(𝑥,𝑦)𝑥∈Val 𝑋
More generally, additional variable z:
𝑃(𝑦|𝑥, 𝑧) = 𝑃 𝑥 𝑦,𝑧 𝑃(𝑦|𝑧)
𝑃(𝑥|𝑧)
30
posterior
likelihood Prior
Normalization constant
Independence
𝑋 and 𝑌 independent, if 𝑃(𝑌|𝑋) = 𝑃(𝑌)
𝑃 𝑌 𝑋 = 𝑃 𝑌 → (𝑋 ⊥ 𝑌)
Proposition: 𝑋 and 𝑌 independent if and only if 𝑃(𝑋, 𝑌) = 𝑃(𝑋)𝑃(𝑌)
31
Conditional independence
Independence is rarely true; conditional independence is more prevalent
𝑋 and 𝑌 conditionally independent given Z if
𝑃(𝑌|𝑋, 𝑍) = 𝑃(𝑌|𝑍)
𝑃(𝑌|𝑋, 𝑍) = 𝑃(𝑌|𝑍) → (𝑋 ⊥ 𝑌 | 𝑍)
𝑋 ⊥ 𝑌 𝑍 if and only if 𝑃 𝑋, 𝑌 𝑍 = 𝑃 𝑋 𝑍 𝑃 𝑌 𝑍
32
Joint distribution, marginalization
Two random variables – Grades (G) & Intelligence (I)
𝑃 𝐺, 𝐼 =
For 𝑛 binary variables, the table (multiway array) gets really big
𝑃 𝑋1, 𝑋2, … , 𝑋𝑛 has 2𝑛 entries!
Marginalization – Compute marginal over a single variable
𝑃(𝐺 = 𝐵) = 𝑃(𝐺 = 𝐵, 𝐼 = 𝑉𝐻) + 𝑃(𝐺 = 𝐵, 𝐼 = 𝐻) = 0.2
33
G I VH H
A 0.7 0.1
B 0.15 0.05
Marginalization – the general case
Compute marginal distribution 𝑃 𝑋𝑖 from 𝑃 𝑋1, 𝑋2, … , 𝑋𝑖 , 𝑋𝑖+1, … , 𝑋𝑛
𝑃 𝑋1, 𝑋2, … , 𝑋𝑖 = 𝑃 𝑋1, 𝑋2, … , 𝑋𝑖 , 𝑥𝑖+1, … 𝑥𝑛
𝑥𝑖+1,…,𝑥𝑛
𝑃 𝑋𝑖 = 𝑃 𝑥1, … , 𝑥𝑖−1, 𝑋𝑖
𝑥1,…,𝑥𝑖−1
34
If binary variables, need to sum over 2𝑛−1 terms!
Example problem
Estimate the probability 𝜃 of landing in heads using a biased coin
Given a sequence of 𝑁 independently and identically distributed (iid) flips
Eg., 𝐷 = 𝑥1, 𝑥2, … , 𝑥𝑁 = {1,0,1,… , 0}, 𝑥𝑖 ∈ {0,1}
Model: 𝑃 𝑥|𝜃 = 𝜃𝑥 1 − 𝜃 1−𝑥
𝑃(𝑥|𝜃 ) = 1 − 𝜃, 𝑓𝑜𝑟 𝑥 = 0𝜃, 𝑓𝑜𝑟 𝑥 = 1
Likelihood of a single observation 𝑥𝑖 ? 𝑃 𝑥𝑖|𝜃 = 𝜃𝑥𝑖 1 − 𝜃 1−𝑥𝑖
35
Frequentist Parameter Estimation
Frequentists think of a parameter as a fixed, unknown constant, not a random variable
Hence different “objective” estimators, instead of Bayes’ rule
These estimators have different properties, such as being “unbiased”, “minimum variance”, etc.
A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties
𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑃 𝐷 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑃(𝑥𝑖|𝜃)𝑁𝑖=1
36
MLE for Biased Coin
Objective function, log likelihood
𝑙 𝜃; 𝐷 = log 𝑃 𝐷 𝜃 = log 𝜃𝑛ℎ 1 − 𝜃 𝑛𝑡 = 𝑛ℎ log 𝜃 +𝑁 − 𝑛ℎ log 1 − 𝜃
We need to maximize this w.r.t. 𝜃
Take derivatives w.r.t. 𝜃
𝜕𝑙
𝜕𝜃=
𝑛ℎ
𝜃−
𝑁−𝑛ℎ
1−𝜃= 0 ⇒ 𝜃 𝑀𝐿𝐸 =
𝑛ℎ
𝑁 or 𝜃 𝑀𝐿𝐸 =
1
𝑁 𝑥𝑖𝑖
37
Bayesian Parameter Estimation
Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule:
𝑃(𝜃|𝐷) =𝑃 𝐷 𝜃 𝑃(𝜃)
𝑃(𝐷)=
𝑃 𝐷 𝜃 𝑃(𝜃)
𝑃 𝐷 𝜃 𝑃 𝜃 𝑑𝜃
The crucial equation can be written in words
𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 = 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑×𝑝𝑟𝑖𝑜𝑟
𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
For iid data, the likelihood is 𝑃 𝐷 𝜃 = 𝑃(𝑥𝑖|𝜃)𝑁𝑖=1
𝜃𝑥𝑖 1 − 𝜃 1−𝑥𝑖 = 𝜃 𝑥𝑖𝑖 1 − 𝜃𝑁𝑖=1
1−𝑥𝑖𝑖 = 𝜃#ℎ𝑒𝑎𝑑 1 − 𝜃 #𝑡𝑎𝑖𝑙
The prior 𝑃 𝜃 encodes our prior knowledge on the domain
Different prior 𝑃 𝜃 will end up with different estimate 𝑃(𝜃|𝐷)!
38
𝜃
𝑋
𝑁
Prior over 𝜃, Beta distribution
𝑃 𝜃; 𝛼, 𝛽 =Γ 𝛼+𝛽
Γ 𝑎 Γ 𝛽𝜃𝛼−1 1 − 𝜃 𝛽−1
When x is discrete Γ 𝑥 + 1 = 𝑥Γ 𝑥 = 𝑥!
Posterior distribution of 𝜃
𝑃 𝜃|𝑥1, … , 𝑥𝑁 =𝑃 𝑥1,…,𝑥𝑁 𝜃 𝑃 𝜃
𝑃 𝑥1,…,𝑥𝑁∝
𝜃𝑛ℎ 1 − 𝜃 𝑛𝑡𝜃𝛼−1 1 − 𝜃 𝛽−1 = 𝜃𝑛ℎ+𝛼−1 1 − 𝜃 𝑛𝑡+𝛽−1
𝛼 and 𝛽 are hyperparameters and correspond to the number of “virtual” heads and tails (pseudo counts)
39
Bayesian estimation for biased coin
Bayesian Estimation for Bernoulli
Posterior distribution 𝜃
𝑃 𝜃|𝑥1, … , 𝑥𝑁 =𝑃 𝑥1,…,𝑥𝑁 𝜃 𝑃 𝜃
𝑃 𝑥1,…,𝑥𝑁∝
𝜃𝑛ℎ 1 − 𝜃 𝑛𝑡𝜃𝛼−1 1 − 𝜃 𝛽−1 = 𝜃𝑛ℎ+𝛼−1 1 − 𝜃 𝑛𝑡+𝛽−1
Posterior mean estimation:
𝜃𝑏𝑎𝑦𝑒𝑠 = 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 = 𝐶 𝜃 × 𝜃𝑛ℎ+𝛼−1 1 − 𝜃 𝑛𝑡+𝛽−1𝑑𝜃 =(𝑛ℎ+𝛼)
𝑁+𝛼+𝛽
Prior strength: 𝐴 = 𝛼 + 𝛽
A can be interpreted as an imaginary dataset
40