167
Bioinformatics II Bioinformatics II Theoretical Bioinformatics and Machine Learning Theoretical Bioinformatics and Machine Learning Part 1 Part 1 Sepp Hochreiter Institute of Bioinformatics Johannes Kepler University, Linz, Austria

Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

  • Upload
    others

  • View
    82

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics IIBioinformatics IITheoretical Bioinformatics and Machine LearningTheoretical Bioinformatics and Machine Learning

Part 1Part 1

Sepp HochreiterInstitute of Bioinformatics

Johannes Kepler University, Linz, Austria

Page 2: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Course

6 ECTS 4 SWS VO (class)

3 ECTS 2 SWS UE (exercise)

Basic Course of Master Bioinformatics

Class: Mo 15:30-17:00 (HS14) and Thu 15:30-17:00 (T111)

Exercise: Wed 13:45-15:15 (KG712)

VO: final examUE: weekly homework (evaluated)

Other Courses of the Masters in Bioinformatics:BI III: Tue 15:30-17:00 (T211)BI IV: Fr 8:30 – 11:45 (2weekly beginning 07.03; KG712)

Exercise: Thu 8:30-10:00 (2weekly beginning 13.03; KG712)

Page 3: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Outline

1 Introduction2 Basics of Machine Learning3 Theoretical Background of Machine Learning4 Support Vector Machines5 Error Minimization and Model Selection6 Neural Networks7 Bayes Techniques8 Feature Selection9 Hidden Markov Models10 Unsupervised Learning: Projection Methods and Clustering**11 Model Selection**12 Non-parametric methods:Decision trees and k-nearest neighbors**13 Graphical Models / Belief networks / Bayes Networks

Page 4: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Outline

1 Introduction

2 Basics of Machine Learning2.1 Machine Learning in Bioinformatics2.2 Introductory Example2.3 Supervised and Unsupervised Learning2.4 Reinforcement Learning2.5 Feature Extraction, Selection, and Construction2.6 Parametric vs. Non-Parametric Models2.7 Generative vs. descriptive Models2.8 Prior and Domain Knowledge2.9 Model Selection and Training2.10 Model Evaluation, Hyperparameter Selection, and Final Model

Page 5: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Outline

3 Theoretical Background of Machine Learning3.1 Model Quality Criteria3.2 Generalization error3.3 Minimal Risk for a Gaussian Classification Task3.4 Maximum Likelihood3.5 Noise Models3.6 Statistical Learning Theory

Page 6: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Outline

4 Support Vector Machines4.1 Support Vector Machines in Bioinformatics4.2 Linear Separable Problems4.3 Linear SVM4.4 Linear SVM for Non-Linear Separable Problems4.5 Average Error Bounds for SVMs4.6 nu-SVM4.7 Non-Linear SVM and the Kernel Trick4.8 Example: Face Recognition4.9 Multiclass SVM4.10 Support Vector Regression4.11 One Class SVM4.12 Least Square SVM4.13 Potential Support Vector Machine4.14 SVM Optimization and SMO4.15 Designing Kernels for Bioinformatic Applications4.16 Kernel Principal Component Analysis4.17 Kernel Discriminant Analysis4.18 Software

Page 7: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Outline

5 Error Minimization and Model Selection5.1 Search Methods and Evolutionary Approaches5.2 Gradient Descent5.3 Step-size Optimization5.4 Optimization of the Update Direction5.5 Levenberg-Marquardt Algorithm5.6 Predictor Corrector Methods for R(w) = 05.7 Convergence Properties5.8 On-line Optimization

Page 8: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Outline

6 Neural Networks6.1 Neural Networks in Bioinformatics6.2 Motivation of Neural Networks6.3 Linear Neurons and Perceptron6.4 Multi Layer Perceptron6.5 Radial Basis Function Networks6.6 Reccurent Neural Networks

Page 9: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Outline

7 Bayes Techniques7.1 Likelihood, Prior, Posterior, Evidence7.2 Maximum A Posteriori Approach7.3 Posterior Approximation7.4 Error Bars and Confidence Intervals7.5 Hyper-parameter Selection: Evidence Framework7.6 Hyper-parameter Selection: Integrate Out7.7 Model Comparison7.8 Posterior Sampling

Page 10: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Outline

8 Feature Selection8.1 Feature Selection in Bioinformatics8.2 Feature Selection Methods8.3 Microarray Gene Selection Protocol

9 Hidden Markov Models9.1 Hidden Markov Models in Bioinformatics9.2 Hidden Markov Model Basics9.3 Expectation Maximization for HMM: Baum-Welch Algorithm9.4 Viterby Algorithm9.5 Input Output Hidden Markov Models9.6 Factorial Hidden Markov Models9.7 Memory Input Output Factorial Hidden Markov Models9.8 Tricks of the Trade9.9 Profile Hidden Markov Models

Page 11: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Outline

10 Unsupervised Learning: Projection Methods and Clustering10.1 Introduction10.2 Principal Component Analysis10.3 Independent Component Analysis10.4 Factor Analysis10.5 Projection Pursuit and Multidimensional Scaling10.6 Clustering

Page 12: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Literature

•ML: Duda, Hart, Stork; Pattern Classification; Wiley & Sons, 2001

•NN: C. M. Bishop; Neural Networks for Pattern Recognition, Oxford Univ. Press, 1995

•SVM: Schölkopf, Smola; Learning with kernels, MIT Press, 2002

•SVM: V. N. Vapnik; Statistical Learning Theory, Wiley & Sons, 1998

•Statistics: S. M. Kay; Fundamentals of Statistical Signal Processing, Prent. Hall, 1993

•Bayes Nets: M. I. Jordan; Learning in Graphical Models, MIT Press, 1998

•ML: T. M. Mitchell; Machine Learning, Mc Graw Hill, 1997

•NN: R. M. Neal, Bayesian Learning for Neural Networks, Springer, 1996

•Feature Selection: Guyon, Gunn, Nikravesh, Zadeh; Feature Extraction - Foundations and Applications, Springer, 2006

•BI: Schölkopf, Tsuda, Vert ; Kernel Methods in Computational Biology, MIT, 2003

Page 13: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Chapter 1

Introduction

Page 14: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introduction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• part of curriculum “master of science in bioinformatics”

• many fields in bioinformatics are based on machine learning

- secondary and 3D structure prediction of proteins

- microarrays: data preprocessing, gene selection, prediction

- DNA data: alternative splicing, nucleosome position,gene regulation

• methods: neural networks, support vector machines, kernel approaches, projection method, belief networks

• goals: noise reduction, feature selection, structure extraction, classification / regression, modeling

Page 15: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introduction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• Examples: - cancer treatment outcomes / microarrays

- classification of novel protein sequences intostructural or functional classes

- dependencies between DNA markers(SNP - single nucleotide polymorphisms) anddiseases (schizophrenia or alcohol dep.)

• only the most prominent machine learning techniques

• no much mathematical or practical details

• only few selected applications in biology and medicine

• Goals: - how to chose appropriate methods from a given pool - understand and evaluate the different approaches- where to obtain and how to use them- adapt and modify standard algorithms

Page 16: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Chapter 2

Basics of Machine Learning

Page 17: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Basics of Machine Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• deductive: programmer must understand the problem and find a solution and implement it

• inductive: solution to a problem is found by a machine which learns

• inductive is data driven: biology, chemistry, biophysics, medicine,and other fields in life sciences possess a huge amount of data

• learning: automatically finds structures in the data

• algorithms that automatically improve a solution with more data

Page 18: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Basics of Machine Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Machine Learning:

• classification• prediction• structuring (clustering)• compression• visualization• filtering• selecting relevant components• extracting dependencies• modeling the data generating system• constructing noise models• integrating

Page 19: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Machine Learning in Bioinformatics

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• secondary structure prediction (neural nets, support vector machines)• gene recognition (hidden Markov models)• multiple alignment (hidden Markov models, clustering)• splice site recognition (neural networks)• microarray data: normalization (factor analysis)• microarray data: gene selection (feature selection)• microarray data: prediction (neural nets, support vector machines)• microarray data: dependencies (independent component analysis,

clustering)• protein structure and function classification (support vector machines,

recurrent networks)• alternative splice site recognition (SVMs, recurrent nets) • prediction of nucleosome positions• single nucleotide polymorphism (SNP) new approaches• peptide and protein arrays new approaches• systems biology and modeling new approaches

Page 20: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introductionary Example

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Example from ``Pattern Classification'', Duda, Hart, and Stork, 2001, John Wiley \& Sons, Inc.

• salmons must be distinguished from sea bass given images

• automated system to separate fishes in a fish-packing company

• Given: a set of pictures with known fishes, the training set

• Goal: in the future, automatically separate images of salmon from images of sea bass,that is generalization

Page 21: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introductionary Example

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 22: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introductionary Example

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• First step: preprocessing and feature extraction

• Preprocessing: contrast / brightness correction, segmentation, alignment

• Features: length of the fish, lightness

• Length:

optimaldecisionboundary:minimalmis-classifications

Page 23: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introductionary Example

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• Lightness:

Different features may be differently suited for the problemMisclassifcations are weighted equally (otherwise new optimal boundary

Page 24: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introductionary Example

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• Width of the fishes:

width may only be suited in combination with other featuresHypothesis: Lightness changes with age, width indicates age

Page 25: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introductionary Example

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• optimal lightness: nonlinear function of the width that isoptimal boundary is a nonlinear curve

new fish at “?”, we would guess salmon but system fails: low generalization, one outlier sea bass changed the curve

Page 26: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introductionary Example

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• one sea bass has lightness and width typically for salmon

• complex boundary curve also catches this outlier and assign surrounding space to sea bass

• future examples in this region will be wrongly classified

decision boundary with high generalization

Page 27: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Introductionary Example

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• we selected the features which are best suited

• bioinformatics applications: number of features is large

• selecting the best feature by visual inspections is impossible

• certain cancer type must be chosen from 30,000 human genes

• feature selection is important: machine selects the features

• construct new features from the old ones: feature construction

• question of cost: how expensive is a certain error

• measurement noise: features

• classification noise: what errors of human labeling are to expect

• first example of too complex model overspecialized to training data

Page 28: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• in our fish example an expert characterized the data by labeling them

• supervised learning : desired output (target) for each object is given

• unsupervised learning : no desired output per object

• supervised: error value on each object classification / regression / time series analysisfish example: classification salmon vs. see bass

regression predict age of the fishtime series prediction growth from past

• unsupervised: - cumulative error over all objects (entropy, statistical independence, information content, etc.)

- probability of model producing the data: likelihood- principal component analysis (PCA), independent

component analysis (ICA), factor analysis, projectionpursuit, clustering (k-means), mixture models, density estimation, hidden Markov models, belief networks

Page 29: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• projection: representation of objects, down-project feature vectors , PCA: orthogonal maximal data variation components, ICA: statistically mutual independent components, factor analysis: PCA with noise

• density estimation: density model of observed data

• clustering: extract clusters – regions data accumulation (typical data)

• clustering and (down-)projection: feature construction, compact representation of the data, non-redundant, noise removal

Page 30: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 31: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 32: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 33: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Isomap: method for down-projecting data

Page 34: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 35: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 36: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

ICA: on images

Page 37: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Supervised and Unsupervised Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

ICA: on video components

Page 38: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Reinforcement Learning

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Not considered because not relevant for bioinformatics:

• reinforcement learning: - model produces output sequence- reward or a penalty at sequence end

or during the sequence (no target output)

• neither supervised nor unsupervised learning

• model: policy

• learning: world model or value function

• two learning techniques : direct policy optimization vs. policy / value iteration (world model)

• exploitation / exploration trade-off: better to learn or to gain reward

• methods: Q-learning, SARSA, Temporal Difference (TD), Monte Marlo estimation

Page 39: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Feature Extraction, Selection, and Construction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• our example salmon - sea bass: features must be extracted

• fMRI brain images and EEG measurements:

Page 40: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Feature Extraction, Selection, and Construction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 41: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Feature Extraction, Selection, and Construction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Feature Selection:

• features are directly measured

• huge number of features:microarray 30,000 genes

• other measurementswith many features: peptide arrays, protein arrays, mass spectrometry, SNPs

• many features not related to the task (genes relevant for cancer)

Page 42: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Feature Extraction, Selection, and Construction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 43: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Feature Extraction, Selection, and Construction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 44: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Feature Extraction, Selection, and Construction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Page 45: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Feature Extraction, Selection, and Construction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• features without target correlation may be helpful

• feature with highest target correlation may be a suboptimal selection

f1 f2 t f1 f2 f3 t

-2 3 1 0 -1 0 -12 -3 -1 1 1 0 1-2 1 -1 -1 0 -1 -12 -1 1 1 0 1 1

Table 1: Left hand side: the target t is computed from two features f1 and f2 ast = f1 + f2. No correlation between t and f1. Right hand side: t = f2 + f3.f1, the feature which has highest correlation coefficient with the target (0.9compared to 0.71 of the other features) should not be selected.

Page 46: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Feature Extraction, Selection, and Construction

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Feature Construction:

• combine features to a new features- PCA or ICA - averaging out

• kernel methods map another space where new features are used

• example: sequence of amino acids may be presented by- occurrence vector- certain motifs- their similarity to other sequences

Page 47: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Parametric vs. Non-Parametric Models

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• important step in machine learning is to select a model class

• parametric models: each parameter vector represents a model-- neural networks, where the parameter are the synaptic weights -- support vector machines

• learning: paths through the parameter space

• disadvantages: - different parameterizations of the same function- model complexity and class via the parameters

• nonparametric models: model is locally constant / superimpositions- k-nearest-neighbor (k is hyperparameter – not adjusted) - kernel density estimation- decision tree

• constant models (rules) must be a priori selected that is hyperparameters must be fixed (k, kernel width, splitting rules)

Page 48: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Generative vs. descriptive Models

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• descriptive model: additional description or another representationof the data

• projection methods (PCA, ICA)

• generative models: model should produce the distribution observed for the real world data points

• describing or representing random components which drive the process

• prior knowledge about the world or desired model

• predict new states of the data generation process (brain, cell)

Page 49: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Prior and Domain Knowledge

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• reasonable distance measures for k-nearest-neighbor

• construct problem-relevant features

• extract appropriate features from the raw data

• bioinformatics: distances based on alignment-- string-kernel-- Smith-Waterman-kernel-- local alignment kernel-- motif kernel

• bioinformatics: secondary structure prediction with recurrent networks 3.7 amino acid period of a helix in the input

• bioinformatics: knowledge about the microarray noise (log-values)

• bioinformatics: 3D structure prediction of proteins disulfidbonds

Page 50: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Model Selection and Training

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• Goal: select model with highest generalization performance, that is with the best performance on future data,

from the model class

• model selection is training is learning

• model which best explains or approximates the training set

• remember: salmon vs. sea bass the model which perfectly explains the training data had low generalization performance

• “overfitting”: model is fitted (adapted) to special training characteristics-- noisy measurements-- outliers-- labeling errors

Page 51: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Model Selection and Training

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• “underfitting”: training data cannot be fitted well enough

• trade-off between underfitting and overfitting

Page 52: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Model Selection and Training

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• overfitting bounded: model class (k in k-nearest-neighbor, number of units in neural networks, maximal weights, etc.)

• model class often chosen a priori

• Sometimes model class can be adjusted during training

• structural risk minimization

• model selection parameters may influence the model complexity- nonlinearity of neural networks is increased during training- model selection procedure cannot find complex models

• hyperparameters: parameters controlling the model complexity

Page 53: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Model Evaluation, Hyperparameter Selection, and Final Model

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• how to select the hyperparameters? ( number of features)

• kernel density estimation (KDE): best hyperparameter (the kernelwidth) can be computed under certain assumptions

• n-fold cross-validation for hyperparameter selection:- training set is divided into n parts- n runs where in the i-th run part i is used for test- average error over all runs for all hyperparameter combinations- chose parameter combination with smallest average error

• cross-validation error approximates generalization error, but- cross validation training sets are overlapping - points from the withhold fold are predicted with the same model

so that an outlier would have multiple influence on the result

• leave-one-out cross validation: only one data point is removed

• assumption: trainings size is not important (one fold is removed)

Page 54: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Model Evaluation, Hyperparameter Selection, and Final Model

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

• How to estimate the performance of a model?

• n-fold cross validation, but- another k-fold cross validation on each training set to select

the hyperparameters- also feature selection and feature ranking must be done for each

training set, i.e. for each fold

• well know error: feature selection on all data and then cross-validation- from equal relevant features the ones which are relevant also

on the test fold are ranked higher

Page 55: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Model Evaluation, Hyperparameter Selection, and Final Model

1 Introduction

2 Basics

2.1 Bioinformatics

2.2 Example

2.3 Un-/Supervised

2.4 Reinforcement

2.5 Feature Extraction

2.6 Non-/Parametric

2.7 Generative descriptive

2.8 Prior Knowledge

2.9 Model Selection

2.10 Model Evaluation

Comparing models

• type I and type II error: - Type I: wrongly detect a difference- Type II: miss a difference

• methods for testing the performance:- paired t-test: > multiply dividing the data into test and training set

> to many type I errors- k-fold cross-validated paired t-test: fewer type I errors than p. t-test - McNemar's test: good type I and type II errors- 5x2CV (5 times two fold cross-validation): comparable to McNemar

> two fold: many test points, no overlapping training

• other criteria:- space and time complexity - above for training and for testing (practical use)- training time oft not relevant (wait a week to make money)- faster test, then averaging over many runs is possible

Page 56: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Chapter 3

Theoretical Background of Machine Learning

Page 57: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Theoretical Background of Machine Learning

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• quality criteria goal for model selection / learning

• approximations

• unsupervised learning: Maximum Likelihood

• concepts: bias and variance, efficient estimator, Fisher information

• unsupervised approach to supervised learning: error model

Page 58: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Theoretical Background of Machine Learning

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• Does learning from examples help in the future?

• “empirical risk minimization‘” (ERM)

• complexity is restricted and dynamics fixed

• “Learning helps”: more training examples improve the model

• converges to the best model for all future data

• convergence is fast

• complexity of a model class: VC-dimension (Vapnik-Chervonenkis)

• “structural risk minimization” (SRM): complexity and model quality

• bounds on the generalization error

Page 59: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Model Quality Criteria

3 Theor. background3.1 Model Quality3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• Learning equivalent to model selection

• quality criteria: future data is optimally processed

• other concepts: visualization, modeling, data compression

• Kohonen networks: no scalar quality criterion (potential function)

• advantage quality criteria: -- comparison of different models-- quality during learning known

• supervised quality criteria: rate of misclassifications or squared error

• unsupervised criteria: -- likelihood-- ratio of between and within cluster distances -- independence of the components -- information content -- expected reconstruction error

Page 60: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Generalization Error

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• performance of a model on future data: generalization error

• error on one example: loss or error

• expected loss: risk or generalization error

Page 61: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Definition of the Generalization Error/Risk

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Training set:

Label or target value:

Simple: and

X =©x1, . . . ,xl

ª

Training set:

Matrix notation for training inputs:

Vector notation for labels:

Matrix notation for training set:

X =¡x1, . . . ,xl

¢Ty =

¡y1, . . . , yl

¢T

z = (x, y)

yi ∈ R

©z1, . . . , zl

ª

Z =¡z1, . . . ,zl

¢

z ∈ Z = Rd+1

Page 62: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Definition of the Generalization Error/Risk

The loss function

quadratic loss:

zero-one loss:

Generalization error:

L(y, g(x;w)) = (y − g(x;w))2

R(g(.;w)) = Ez (L(y, g(x;w))) =

ZZ

L(y, g(x;w)) p (z) dz

L(y, g(x;w)) =

½0 for y = g(x;w)1 for y 6= g(x;w)

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 63: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Definition of the Generalization Error/Risk

y is a function of x (target function: y = f(x)) plus noise:

Now the risk can be computed as

y = f(x) + ²

p(y | x) = pn(y − f(x))

p (z) = p (x) p(y | x) = p (x) pn(y − f(x))

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

R(g(.;w)) =

ZZ

L(y, g(x;w)) p (x) pn(y − f (x)) dz =

ZX

p (x)

ZRL(y, g(x;w)) pn(y − f (x)) dy dx

Page 64: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Definition of the Generalization Error/Risk

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

R(g(x;w)) = Ey|x (L(y, g(x;w))) =ZRL(y, g(x;w)) pn(y − f(x)) dy

Page 65: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Definition of the Generalization Error/Risk

The noise-free case is y = f(x)

R(g(x;w)) = L(f(x), g(x;w)) = L(y, g(x;w))

simplifies to:

R(g(.;w)) =

ZX

p (x)L(f(x), g(x;w))dx

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 66: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

• p(z) is unknown

• especially p(y|x)

• risk cannot be computed

• practical applications: approximation of the risk

• model performance estimation for the user

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Empirical Estimation of the Generalization Error

Page 67: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Test Set

Test set approximation:

R(g(.;w)) = Ez (L(y, g(x;w)))

expectation can be approximated using

R(g(.;w)) ≈ 1

m

l+mXi=l+1

L¡yi, g(xi;w)

¢

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

©zl+1, . . . , zl+m

ªwith test set:

Page 68: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Cross-Validation

Cross-validation folds:

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• not enough data for test set (needed for training)

• cross-validation

Page 69: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Cross-Validation

n-fold cross-validation(here 5-fold):

Page 70: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Rn−cv(Zl) =1

n

nXj=1

n

l

Xz∈Zj

l/n

³L³y, g

³x;wj

³Zl \ Zjl/n

´´´´| {z }

Rn−cv,j(Zl)

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

EZl(1−1/n)¡R¡g¡.;w

¡Zl(1−1/n)

¢¢¢¢= EZl (Rn−cv (Zl))

Cross-Validation

cross-validation is an almost unbiased estimator for the generalization error:

Generalization error on trainings size without one fold l – l/n can be estimated by cross-validation on training data l by n-fold cross-validation

Page 71: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Cross-Validation

• advantage: test examples only once used (better than multiple dividing the data into training and test set)

• disadvantage: -- training sets are overlapping-- one fold on same model test examples dependent-- these dependencies cv has high variance

(one outlier influences all estimates)

• special case: leave-one-out cross-validation (LOO-CV)-- l-fold cross-validation, where each fold is one example-- test examples to not use the same model-- training sets are maximal overlapping

Page 72: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

Class y = 1 data points are drawn according to

and class y = -1 according to

where the Gaussian has density

p(x | y = 1) ∝ N (μ1,Σ1)

p(x | y = −1) ∝ N (μ−1,Σ−1)

N (μ,Σ)

p(x) =1

(2 π)d/2 |Σ|1/2

exp

µ−12(x− μ)T Σ−1(x− μ)

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 73: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Linear transformations of Gaussians lead to Gaussians

Page 74: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

• probability of observing a point at x:

y is “integrated out” -- here “summed out”

p(x) = p(x, y = 1) + p(x, y = −1)

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• probability of observing a point from class y=1 at x:

• probability of observing a point from class y=-1 at x:

• Conditional probability:

p(x, y = 1)

p(x, y = −1)

p(x, y = 1) = p(x | y = 1) p(y = 1)

Page 75: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

• two-dimensional classification task

• data for each class from a Gaussian(black: class 1, red: class -1)

• optimal discriminantfunctions are two hyperbolas

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 76: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• Bayes rule for probability of x belonging to class y=1:

p(y = 1 | x) = p(x | y = 1) p(y = 1)p(x)

regions of predicted class y = 1:

X1 = {x | g(x) > 0}regions of predicted class y = −1:

X−1 = {x | g(x) < 0} .loss function:

L(y, g(x; )) =

½0 for y g(x; ) > 01 for y g(x; ) < 0

Page 77: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

Risk:3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

R(g(.;w)) =RZL(y, g(x;w)) p (z) dz

Loss function contributions:

X1 : p (x, y = −1)X−1 : p (x, y = 1)

R(g(.;w)) =

ZX1

p (x, y = −1) dx +

ZX−1

p (x, y = 1) dx =ZX1

p (y = −1 | x) p(x) dx +

ZX−1

p (y = 1 | x) p(x) dx =ZX

½p (y = −1 | x) for g(x) > 0p (y = 1 | x) for g(x) < 0

¾p(x) dx

Page 78: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

The minimal risk is

g(x;w)

½> 0 for p (y = 1 | x) > p (y = −1 | x)< 0 for p (y = −1 | x) > p (y = 1 | x) .

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Optimal discriminant (see later) function:

at each position x take smallest value

Rmin =

ZX

min{p (x, y = −1) , p (x, y = 1)} dx =ZX

min{p (y = −1 | x) , p (y = 1 | x)} p(x) dx

Page 79: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 80: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 81: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

• discriminant function g: g(x)>0 then x is assigned to y=1g(x)<0 then x is assigned to y=-1

• classification functions :y(x)

• optimal discriminant functions (minimal risk):

or

y(x) = sign(g(x))

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

g(x) = p(y = 1 | x) − p(y = −1 | x)

g(x) = ln p(y = 1 | x) − ln p(y = −1 | x) =

lnp(x | y = 1)p(x | y = −1) + ln

p(y = 1)

p(y = −1)

Page 82: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

g(x) = −12(x − μ1)T Σ−11 (x − μ1) −

d

2ln 2π −

1

2ln |Σ1| + ln p(y = 1) +

1

2(x − μ2)T Σ−12 (x − μ2) +

d

2ln 2π +

1

2ln |Σ2| − ln p(y = −1) =

−12(x − μ1)T Σ−11 (x − μ1) −

1

2ln |Σ1| + ln p(y = 1) +

1

2(x − μ2)T Σ−12 (x − μ2) +

1

2ln |Σ2| − ln p(y = −1) =

−12xT¡Σ−11 − Σ−12

¢| {z }A

x + xT¡Σ−11 μ1 − Σ−12 μ2

¢| {z }w

1

2μT1 Σ

−11 μ1 +

1

2μT2 Σ

−12 μ2 −

1

2ln |Σ1| +

1

2ln |Σ2| +

ln p(y = 1) − ln p(y = −1) =

−12xTAx + wTx + b

For Gaussians:

Page 83: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

1D 2D

3D

Page 84: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Minimal Risk for a Gaussian Classification Task

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 85: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Maximum Likelihood

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• One of the major objectives if learning generative models

• It has certain theoretical properties

• Theoretical concepts like efficient estimator or biased estimator are introduced

• Even supervised methods can be viewed as special case ofmaximum likelihood

Page 86: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Loss for Unsupervised Learning

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

First we consider different loss functions which are used for unsupervised learning

Generative approaches maximum likelihood

Projection methods low information loss plus desired property

Parameter estimation difference of estimated parameter vector to the optimal parameter vector

Page 87: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Projection Methods

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• data projection into another space with desired requirements

Page 88: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Projection Methods

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• “Principal Component Analysis” (PCA): projection to a low dimensional space under maximal information conservation

• “Independent Component Analysis” (ICA): projection into a space with statistically indpendent components (factorial code)

often characteristics of a factorial distribution are optimized:-- maximal entropy (given variance)-- cummulants

or prototype distributions should be matched:-- product of special super-Gaussians

• “Projection Pursuit”: components are maximally non-Gaussian

Page 89: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Generative Models

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

“generative model”: model simulates the world and produces the same data as the world

Page 90: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Generative Models

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• data generation process is probabilistic: underlying distribution

• generative model attempts at approximation this distribution

• loss function the distance between model output distribution and the distribution of the data generation process

• Examples: “Factor Analysis”, “Latent Variable Models”, “Boltzmann Machines”, “Hidden Markov Models”

Page 91: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Parameter Estimation

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• parameterized model known

• task: estimate actual parameters

• loss: difference between true and estimated parameter

• evaluate estimator: expected loss

Page 92: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Mean Squared Error, Bias, and Variance

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Theoretical concepts of parameter estimation

• training data: where

simply (the matrix of training data)

• true parameter vector:

• estimate of :

zi = xi

{x} =©x1, . . . ,xl

ª{z} =©z1, . . . , zl

ª

X =¡x1, . . . ,xl

¢T

ww

w

Page 93: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Mean Squared Error, Bias, and Variance

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• unbiased estimator:

on average (over training set) the true parameter is obtained

• bias:

• variance:

• mean squared error (MSE, different to supervised loss):

expected squared error between the estimated and true parameter

EXw = w

b(w) = EXw − w

mse(w) = EX

³(w − w)

T(w − w)

´var(w) = EX

³(w − EX(w))

T (w − EX(w))´

Page 94: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Mean Squared Error, Bias, and Variance

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

wOnly depends on

EX³(w − EX(w))

T (EX(w) − w)´=

(EX (w) − EX (w))T (EX (w) − w) = 0

X

zero

mse(w) = EX

³(w − w)

T(w − w)

´=

EX

³((w − EX(w)) + (EX(w) − w))

T

((w − EX(w)) + (EX(w) − w))) =

EX

³(w − EX(w))

T(w − EX(w)) −

2 (w − EX(w))T(EX(w) − w) +

(EX(w) − w)T (EX(w) − w)´=

EX

³(w − EX(w))

T(w − EX(w))

´+

(EX(w) − w)T(EX(w) − w) =

var(w) + b2(w)

Page 95: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Mean Squared Error, Bias, and Variance

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Averaging reduces variance – each of the subsets hasexamples which gives examples in total

Average is where

Unbiased:

Variance:

ˆwaN =1

N

NXp=1

wi

EX ( ˆwaN ) =1

N

NXp=1

EXiwi =1

N

NXp=1

w = w

covarX ( ˆwaN ) =1

N2

NXp=1

covarXi (wi) =

1

N2

NXp=1

covarX,l/N (w) =1

NcovarX,l/N (w)

wi = wi (Xi) Xi =nx(i−1)l/N+1, . . . ,xil/N

o

l/NN

l

Page 96: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Mean Squared Error, Bias, and Variance

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• averaging: training sets are independent, thereforecovariance between them vanishes

• Minimal Variance Unbiased (MVU) estimator: construct from all unbiased estimators the one with minimal variance

• MVU estimator does not always exist

• methods to check whether a given estimator is a MVU

Xi

Page 97: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• We will find a lower bound for the variance of an unbiased estimator:Cramer-Rao Lower Bound (that is a lower bound for the MSE)

• We need the Fisher information matrix :IF

IF (w) : [IF (w)]ij = − Ep(x;w)µ∂ ln p(x;w)

∂wi

∂ ln p(x;w)

∂wj

Ep(x;w)

µ∂ ln p(x;w)

∂wi

∂ ln p(x;w)

∂wj

¶=Z

∂ ln p(x;w)

∂wi

∂ ln p(x;w)

∂wjp(x;w) dx

Page 98: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

If satisfies

then the Fisher information matrix is

Fisher information: information of observation aboutparameter upon which the parameterized density function

of depends

p(x;w) ∀w : Ep(x;w)

µ∂ ln p(x;w)

∂w

¶= 0

IF (w) : IF (w) = − Ep(x;w)µ∂2 ln p(x;w)

∂w ∂w

wp(x;w) x

x

Page 99: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Theorem 1 (Cramer-Rao Lower Bound (CRLB))Assume that

∀w : Ep(x;w)

µ∂ ln p(x;w)

∂w

¶= 0

and that the estimator w is unbiased.Then,

covar(w) − I−1F (w)

is positive definite:

covar(w) − I−1F (w) ≥ 0 .

An unbiased estimator attains the bound in thatcovar(w) = I−1F (w) if and only if

∂ ln p(x;w)

∂w= A(w) (g(x) − w)

for some function g and square matrix A(w).In this case the MVU estimator is

w = g(x) with covar(w) = A−1(w) .

Page 100: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• efficient estimator: reaches the CRLB (efficiently uses the data)

• MVU estimator can be efficient but need not

var(wi) = [covar(w)]ii ≥£I−1F (w)

¤ii

dashed: CRLB

Page 101: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Fisher Information Matrix, Cramer-Rao Lower Bound, and Efficiency

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

dashed: CRLB

Page 102: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Maximum Likelihood Estimator

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• MVU estimator is unknown or does not exist

• Maximum Likelihood Estimator (MLE)

• MLE can be applied to a broad range of problems

• MLE approximates the MVU estimator for large data sets

• MLE is even asymptotically efficient and unbiased

• MLE does everything right and this efficiently (enough data)

Page 103: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Maximum Likelihood Estimator

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

The likelihood of the data set :

probability of the model to produce the data

iid (independent identical distributed) data:

Negative log-likelihood:

L {x} = {x1, . . . ,xl}L({x};w) = p({x};w)

p(x;w)

L({x};w) = p({x};w) =lYi=1

p(xi;w)

− lnL({x};w) = −lXi=1

ln p(xi;w)

Page 104: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Maximum Likelihood Estimator

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• likelihood is based on finite many densities values whichhave zero measure: problem?

• assume instead of the volume element (region around )

• MLE popular: -- simple use -- properties

p(xi;w) dxp(x;w)xi

Page 105: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Properties of Maximum Likelihood Estimator

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

MLE:

• invariant under parameter change

• asymptotically unbiased and efficient asymptotically optimal

• consistent for zero CRLB

Page 106: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

MLE is Invariant under Parameter Change

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Theorem 1 (Parameter Change Invariance)Let g be a function changing the parameter w intoparameter u: u = g(w), then

u = g(w) ,

where the estimators are MLE.If g changes w into different uthen u = g(w) maximizes the likelihood function

maxw:u=g(w)

p({x};w) .

Page 107: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

MLE is Asymptotically Unbiased and Efficient

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

The maximum likelihood estimator is asymptotically unbiased:

The maximum likelihood estimator is asymptotically efficient:

Ep(x;w) (w)l→∞→ w

covar(w)l→∞→ CRLB

Page 108: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

MLE is Asymptotically Unbiased and Efficient

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Theorem 1 (MLE Asymptotic Properties)If p(x;w) satisfies

∀w : Ep(x;w)

µ∂ ln p(x;w)

∂wdx

¶= 0

then the MLE which maximizes p({x};w)is asymptotically distributed according to

wl→∞∝ N

¡w, I−1F (w)

¢,

where IF (w) is the Fisher information matrixevaluated at the unknown parameter w.

Page 109: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

MLE is Asymptotically Unbiased and Efficient

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• practical applications: finite examples MLE performance unknown

• Example: general linear model where

MLE iswhich is efficient and MUV

Note the noise covariance must be known

x = Aw + ² ,

² ∝ N (0,C)

w =¡ATC−1A

¢−1ATC−1x

w ∝ N³w ,

¡ATC−1A

¢−1´.

Page 110: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

MLE is Consistent for Zero CRLB

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• consistent:

for large training sets the estimator approaches the true value(difference to unbiased variance decreases)

• Later more formal definition for consistency as

Thus, the MLE is consistent if the CRLB is zero

wl→∞→ w

liml→∞

p (|w − w| > ²) = 0

Page 111: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Expectation Maximization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• likelihood can be optimized by gradient descent methods

• likelihood cannot be computed analytically:-- hidden states -- many-to-one

output mapping -- non-linearities

Page 112: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Expectation Maximization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• hidden variables, latent variables, unobserved variables

• likelihood is determined by all mapped to

u

u x

Page 113: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

lnL({x};w) = ln p({x};w) = ln

ZU

p({x},u;w) du =

ln

ZU

Q(u | {x})Q(u | {x})p({x},u;w) du ≥Z

U

Q(u | {x}) ln p({x},u;w)Q(u | {x}) du =Z

U

Q(u | {x}) ln p({x},u;w) du −ZU

Q(u | {x}) lnQ(u | {x}) du =

F(Q,w)

Expectation Maximization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• Expectation Maximization (EM) algorithm:

-- joint probability is easier to compute than likelihood

-- estimate by

p(x,u;w)

p(u | x;w) Q(u | x)

Jensen's inequality

Page 114: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Expectation Maximization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• EM algorithm is an iteration between ”E”-step and “M”-step:

E-step:

Qk+1 = argmaxQ

F(Q,wk)

M-step:

wk+1 = argmaxw

F(Qk+1,w)

Page 115: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Expectation Maximization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• After E-step:

Proof:

Kullback-Leibler divergence:

Zero for:

p(u, {x};wk) = p(u | {x};wk) p({x};wk)

Q(u | {x}) = p(u | {x};w)

DKL(Q k p) =RUQ(u | {x}) ln Q(u|{x})

p(u|{x};w) du ≥ 0

Qk+1(u | {x}) = p(u | {x};wk)F(Qk+1,wk) = lnL({x};wk)

F(Q,w) =ZU

Q(u | {x}) ln p({x},u;w)Q(u | {x}) du =Z

U

Q(u | {x}) ln p(u | {x};w)Q(u | {x}) du + ln p({x};w) =Z

U

Q(u | {x}) ln p(u | {x};w)Q(u | {x}) du + lnL({x};w)

Page 116: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Expectation Maximization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• EM increases the lower bound in both steps

• beginning of the M-step:

E-step does not change the parameters

EM algorithm:

-- hidden Markov models-- mixture of Gaussians-- factor analysis-- independent component analysis

F(Qk+1,wk) = lnL({x};wk)

lnL({x};wk) = F(Qk+1,wk) ≤F(Qk+1,wk+1) ≤ F(Qk+2,wk+1) = lnL({x};wk+1)

Page 117: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Noise Models

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• connecting unsupervised and supervised learning

• quality measure

• noise on the targets

• apply maximum likelihood

Page 118: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Noise Models

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• Gaussian target noise• linear model

log-likelihood:

s = X w

y = s + ² = X w + ²

² ∝ N (0,Σ)

L((y,X);w) =1

(2 π)d/2 |Σ|1/2exp

µ−12(y − X w)T Σ−1(y − X w)

lnL((y,X);w) =

− d

2ln (2 π) − 1

2ln |Σ| − 1

2(y − X w)T Σ−1(y − X w)

Page 119: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Noise Models

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• minimize

least square criterion

linear least square estimator

derivative with respect to :

Setting the derivative to zero (Wiener-Hopf equations):

(y − X w)T Σ−1(y − X w)

w

− 2XTΣ−1y + 2XTΣ−1 X w

w =¡XTΣ−1 X

¢−1XTΣ−1y

Page 120: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Gaussian Noise

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Noise covariance matrix gives the noise for each measureIn most cases we have the same noise for each observation:

We obtain

: pseudo inverse or Moore-Penrose inverse

minimal value:

Σ−1 =1

σI

w =¡XTX

¢−1XTy

¡XTX

¢−1XT

1

σyT³I − X

¡XTX

¢−1XT

´y

Page 121: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Laplace Noise and Minkowski Error

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Laplace noise assumption:

More general Minkowski error:

gamma function

ky − g(x;w)k1p(y − g(x;w)) =

β

2exp (− β |y − g(x;w)|)

|y − g(x;w)|r

p(y − g(x;w)) =r β1/r

2 Γ(1/r)exp (− β |y − g(x;w)|r) ,

Γ(a) =

Z ∞0

ua−1 e−u du , Γ(n) = (n− 1)!

Page 122: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Laplace Noise and Minkowski Error

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 123: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Binary Models

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• noise considerations do not hold for binary target

• classification not treated

Page 124: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Cross-Entropy

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

classification problem with K classes:

Likelihood:

gk(x;w) = p(y = ek | x)y ∈ {e1, . . . , eK}

If x is in the kth class then y = (0, . . . , 0, 1, 0, . . . , 0)

L({z};w) = p({z};w) =lYi=1

KYk=1

p(yi = ek | xi;w)[yi]kp(xi)

KYk=1

p(yi = ek | xi;w)[yi]k = p(yi = er | xi;w) for yi = er

Page 125: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Cross-Entropy

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

The log-likelihood:

loss function:

cross entropy (Kullback-Leibler)

lnL({z};w) =KXk=1

lXi=1

£yi¤kln p(yi = ek | xi;w) +

lXi=1

ln p(xi)

KXk=1

lXi=1

£yi¤kln p(yi = ek | xi;w)

£yi¤k∼ p(yi = ek)£

yi¤k=

½1 for yi = ek0 for yi 6= ek

Page 126: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Logistic Regression

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

a function g mapping x onto R can be transformed into a probability:

p(y = 1 | x;w) = 1

1 + e−g(x;w)

Page 127: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Logistic Regression

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

If follows:

log-likelihood:

maximum likelihood maximizes

g(x;w) = ln

µp(y = 1 | x)

1 − p(y = 1 | x)

¶.

lnL({z};w) =lXi=1

ln p(zi;w) =lXi=1

ln p(yi,xi;w) =

lXi=1

ln p(yi | xi;w) +lXi=1

ln p(xi)

lXi=1

ln p(yi | xi;w)

Page 128: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Logistic Regression

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

derivative of the log-likelihood:

similar to the derivative of the quadratic loss function in the regression:

instead of

∂wj

lXi=1

ln p(yi | xi;w) =lXi=1

¡p(y = 1 | x;w) − yi

¢ ∂ g(x;w)

∂wj

¡g(x;w) − yi

¢ ¡p(y = 1 | x;w) − yi

¢

p(y = 1 | x;w) = 1

1 + e−g(x;w)

Page 129: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Statistical Learning Theory

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• Does learning help for future tasks?

• Explains a model which explains the training data also new data?

• Yes, if complexity is bounded

• VC-dimension as complexity measure

• statistical learning theory : bounds for the generalization error (future)

• bounds comprise training error and complexity

• structural risk minimization minimizes both terms simultaneously

Page 130: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Statistical Learning Theory

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• statistical learning theory:

-- (1) the uniform law of large numbers (empirical risk minimization)

-- (2) complexity constrained models (structural risk minimization)

• error bound on the mean squared error: bias-variance formulation

-- bias is training error = empirical risk

-- variance is model complexityhigh complexity more models

more solutions large variance

Page 131: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Error Bounds for a Gaussian Classification Task

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• We revisit the Gaussian classification task

Rmin =

ZX

min{p (x, y = −1) , p (x, y = 1)} dx

=

ZX

min{p (x | y = −1) p(y = −1) , p (x | y = 1) p(y = 1)} dx

∀a,b>0 : ∀0≤β≤1 : min{a, b} ≤ aβ b1−β

∀0≤β≤1 : Rmin ≤ (p(y = 1))β(p(y = −1))1−βZ

X

(p (x | y = 1))β (p (x | y = −1))1−β dx

Page 132: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Error Bounds for a Gaussian Classification Task

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• Gaussian assumption:

Chernoff bound: maximizing with respect to

Bhattacharyyabound:

v(β) β

β = 12

v(1/2) =1

4(μ2 − μ1)T (Σ1 + Σ2)

−1(μ2 − μ1)

+1

2ln

¯Σ1 + Σ2

2

¯p|Σ1| |Σ2|

ZX

(p (x | y = 1))β (p (x | y = −1))1−β dx = exp(− v(β))

where

v(β) =β(1− β)

2

(μ2 − μ1)T (β Σ1 + (1− β) Σ2)−1 (μ2 − μ1)

+1

2ln|β Σ1 + (1− β) Σ2|

|Σ1|β |Σ2|1−β

Page 133: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Empirical Risk Minimization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity:3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

empirical risk minimization (ERM) principle states:

if the training set is explained by the model then the model generalizes to future examples

restrict the complexity of the model class

empirical risk minimization (ERM): minimize error on training set

Page 134: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: Finite Number of Functions

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• intuition why complexity matters

• complexity is just the number M of functions in model class

• difference training error (empirical risk) and test error (risk )

empirical risk:

finite set of functions

worst case (learning chooses unknown function):

Remp(g,Z) =1

l

lXi=1

L¡yi, g

¡xi¢¢

{g1, . . . , gM}

maxj=1,...,M

kRemp(gj , l) − R(gj)k

Page 135: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: Finite Number of Functions

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

: union bound

distance of average and expectation: Chernoff inequality(for each j ) where is the empirical mean of the true value for trials

we obtain complexity term

p(a OR b) ≤ p(a) + p(b)

²(l,M, δ) =

rlnM + ln(2/δ)

2 l

p(μl − s > ²) < exp¡− 2 ²2 l

¢

p

µmax

j=1,...,MkRemp(gj , l) − R(gj)k > ²

¶≤

MXj=1

p (kRemp(gj , l) − R(gj)k > ²) ≤

M 2 exp¡− 2 ²2 l

¢= 2 exp

µµlnM

l− 2 ²2

¶l

¶= δ

μl s l

Page 136: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: Finite Number of Functions

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Theorem 1 (Finite Set Error Bound)With probability of at least (1 − δ)over possible training sets with l elementsand for M possible functions we have

R(g) ≤ Remp(g, l) + ²(l,M, δ) .

Page 137: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: Finite Number of Functions

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

should converge to zero as l increases, therefore ²(l,M, δ)

lnM

ll→∞→ 0 ²(l,M, δ) =

rlnM + ln(2/δ)

2 l

Page 138: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• we want apply the previous bound for infinite function classes

• idea: on training set only finite number of functions is different

• example: all discriminant functions g giving the same classification function sign g(.)

• parametric models g(.;w) with parameter vector w

• Does minimizing the parameter on the training set convergenceto the best solution with increasing training set?

• empirical risk minimization (ERM): consistent or not?

• do we select better models with larger training sets?

Page 139: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• parameter which minimizes the empirical risk for l training examples:

• ERM is consistent if

convergence in probability

Empirical risk and expected risk converge to minimal risk

wl = argminw

Remp(g(.;w), l)

R(g(.; wl))l→∞→ inf

wR(g(.;w))

Remp(g(.; wl), l)l→∞→ inf

wR(g(.;w))

Page 140: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 141: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• ERM is strictly consistent if for all

holds (convergence in probability)

Instead of “strictly consistent” we write “consistent”

• maximum likelihood is consistent for a set of densities if

Λ(c) =

½w | z = (x, y),

ZL(y, g(x;w))p(z)dz ≥ c

¾inf

w∈Λ(c)Remp(g(.;w), l)

l→∞→ infw∈Λ(c)

R(g(.;w))

0 < a ≤ p(x;w) ≤ A < ∞∃w1

:

infw

1

l

lXi=1

(− ln p(x;w))l→∞→ inf

w

ZX

(− ln p(x;w)) p(x;w1) dx

Page 142: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• Under what conditions is the ERM consistent?

• New concepts and new capacity measures:-- points to be shattered-- annealed entropy-- entropy (new definition)-- growth function-- VC-dimension

Possibilities to label the input data by binary labels shattering the input data

complexity of a model class: number different labelingshow many points can be shattered

xi yi ∈ {−1, 1}2l

Page 143: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Note, that each “x” is placed in a circle around its position independent of the other “x”.Therefore each constellation represents a set with non-zeroprobability mass.

Page 144: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• number of points a function class can shatter: VC-dimension (later)

• function class

• shattering coefficient: (# points class can shatter)

• entropy of a function class:

• annealed entropy of a function class:

• growth function of a function class:

Jensen supremum

F

NF(x1, . . . ,xl)

HF (l) = E(x1,...,xl) lnNF(x1, . . . ,xl)

HannF (l) = lnE(x1,...,xl) NF (x

1, . . . ,xl)

HF (l) ≤ HannF (l) ≤ GF(l)

GF (l) = ln sup(x1,...,xl)

NF (x1, . . . ,xl)

Page 145: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

ERM fast rate of convergence (exponential convergence):

Theorem 1 (Sufficient Condition for Consistency of ERM)If

liml→∞

HF (l)

l= 0

then ERM is consistent.

p

µsupw|R(g(.;w)) − Remp(g(.; wl), l)| > ²

¶< b exp

¡− c ²2 l

¢Theorem 1 (Sufficient Condition for Fast Rate)If

liml→∞

HannF (l)

l= 0

then ERM has a fast rate of convergence.

Page 146: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• theorems valid for a given probability measure on the observationsprobability measure enters the formulas via the expectation

Theorem 1 (Consistency of ERM for Any Probability)The condition

liml→∞

GF (l)

l= 0

is necessary and sufficient for the ERM to be consistentand also ensures a fast rate of convergence.

Page 147: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• VC (Vapnik-Chervonenkis) dimension is the largest integerfor which holds

If the maximum does not exists:

• VC-dimension is the maximum number of vectors that can beshattered by the function class

dVCGF(l) = l ln 2

dVC = maxl{l | GF (l) = l ln 2} .

dVC = ∞

Page 148: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Theorem 1 (VC-Dimension Bounds the Growth Function)The growth function is bounded by

GF (l)

(= l ln 2 if l ≤ dVC

≤ dVC

³1 + ln l

dVC

´if l > dVC

.

Page 149: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• function class with finite VC-dim.: consistent and converges fast

-- Linear functions in d-dimensional of the input space:

-- Nondecreasing nonlinear one-dimensional functions

-- Nonlinear one-dimensional functions:

g(x;w) = wTx dVC = d

g(x;w) = wTx + b dVC = d + 1

Pki=1

¯ai x

i¯signx + a0 dVC = 1

sin(w z) dVC =∞

Page 150: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Complexity: VC-Dimension

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

-- Neural Networks:

M are the number of units, W is the number of weights, eis the base of the natural logarithm (Baum & Haussler 89, Shawe-Taylor & Anthony 91)

inputs restricted to

Bartlett & Williamson (1996)

dVC ≤ 2 W log2(e M)

dVC ≤ 2 W log2(24 e W D)

[−D;D]

Page 151: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Error Bounds

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• idea of deriving the error bounds: set of distinguishable functionscardinality given by

• trick of two half-samples and their difference (“symmetrization”):

therefore in the following we use 2 ll example used for complexity definition and l for empirical error

minimal possible risk:

NF

w0 = argminw

R(g(.;w))

Rmin = minw

R(g(.;w)) = R(g(.;w0))

p

Ãsupw

¯¯ 1l

lXi=1

L¡yi, g

¡xi;w

¢¢− 1

l

2lXi=l+1

L¡yi, g

¡xi;w

¢¢¯¯ > ² − 1

l

!≥

1

2p

Ãsupw

¯¯ 1l

lXi=1

L¡yi, g

¡xi;w

¢¢− R(g(.;w)

¯¯ > ²

!

Page 152: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Error Bounds

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Theorem 1 (Error Bound)With probability of at least (1 − δ) over possible training setswith l elements, the parameter wl which minimizesthe empirical risk we have

R(g(.;wl)) ≤ Remp(g(.;wl), l) +p²(l, δ) .

With probability of at least (1 − 2δ) the difference betweenthe optimal risk and the risk of wl is bounded by

R(g(.;wl)) − Rmin <p²(l, δ) +

r− ln δ

lHere ²(l, δ) can be defined for a specific probability as

²(l, δ) =8

l(Hann

F (2l) + ln(4/δ))

or for any probability as

²(l, δ) =8

l(GF (2l) + ln(4/δ))

where the later can be expressed though the VC-dimension dVC

²(l, δ) =8

l(dVC (ln(2l/dVC) + 1) + ln(4/δ))

Page 153: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Error Bounds

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• complexity measure depend on the ratio

• The bound above is from Anthony and Bartlett whereas an older bound from Vapnik is

• complexity term decreases with

• zero empirical risk then the bound on the risk decreases with

•Later: expected risk decreases with

dVCl

R(g(.;wl)) ≤ Remp(g(.;wl), l) +

²(l, δ)

2

Ã1 +

s1 +

Remp(g(.;wl), l)

²(l, δ)

!

1√l

1√l

1l

Page 154: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Error Bounds

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• bound on the risk

• bound is similar to the bias-variance formulation-- bias corresponds to empirical risk-- variance corresponds to complexity

R ≤ Remp + complexity

Page 155: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Error Bounds

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• In many practical cases the bound is not useful: not tight

• However in many practical cases the minimum of the bound is closeto the minimum of the test error

Page 156: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Error Bounds

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• regression: instead of the shattering coefficient covering number( covering of the functions with distance epsilon)

• growth function is then:

bounds on the generalization error:

where

R(g(.;wl)) ≤ Remp(g(.;wl), l) +p²(ε, l, δ)

²(ε, l, δ) =36

l(ln(12 l) + G(ε/6,F , l) − ln δ)

G(ε,F , l) = ln supX

N (ε,F ,X∞)

Page 157: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Structural Risk Minimization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

The Structural Risk Minimization (SRM) principle minimizes the guaranteed risk that is a bound on the risk instead of the empirical risk alone

Page 158: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Structural Risk Minimization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• nested set of function classes:where class possesses VC-dimensionand

F1 ⊂ F2 ⊂ . . . ⊂ Fn ⊂ . . .Fn dnVC

d1VC ≤ d2VC ≤ . . . dnVC ≤ . . .

Page 159: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Structural Risk Minimization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• Example for SRM: minimum description lengthsender transmits a model (once) and the inputs and errorsreceiver has to recover the labelsgoal: minimize transmission costs (description length)

• Is the SRM principle consistent? How fast does it converge?

SRM is consistent !!

asymptotic rate of convergence:

where is the minimal risk of the function class

transmissioncosts = Remp + complexity

r(l) = |Rnmin − Rmin| +rdnVC ln l

l,

R nmin Fn

p

µliml→∞

sup r−1(l)¯R³g³. ; wFn

l

´´− Rmin

¯< ∞

¶= 1

Page 160: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Structural Risk Minimization

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

If the optimal solution belongs to some class then theconvergence rate is

|Rnmin − Rmin| l→∞→ 0

Fn

r(l) = O

Ãrln l

l

!

Page 161: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Margin as Complexity Measure

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• VC-dimension: restrictions on the class of function

• most famous:zero isoline of the discriminant function has minimal distance(margin) to all training data points which are contained in a spherewith radius R

γ

Page 162: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Margin as Complexity Measure

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 163: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Margin as Complexity Measure

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

• linear discriminant functions

• classification function

• scaling w and b does not change classification function

• classification function: one representative discriminant function

• canonical form w.r.t. the training data X:

wTx + b

sign¡wTx + b

¢

mini=1,...,l

¯wTxi

¯= 1

Page 164: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Margin as Complexity Measure

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Theorem 1 (Margin bounds VC-dimension)The class of classification functions sign

¡wTx + b

¢,

where the discriminant function wTx + b is in itscanonical form versus X which is contained in a sphereof radius R, and where kwk ≤ 1

γ satisfy

dVC ≤R2

γ2.

Page 165: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Margin as Complexity Measure

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

If at least one data point exists for which the discriminant function is positive and at least one data point exists for which it is negative,then we can optimize b and rescale in order to obtain the smallest

This gives the tightest bound and smallest VC-dimension

After optimizing b and rescaling we have points for which

kwkkwk

kwkwTx1 + b = 1 wTx2 + b = −1

Page 166: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Margin as Complexity Measure

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

Page 167: Bioinformatics II Theoretical Bioinformatics and Machine Learning … · Theoretical Bioinformatics and Machine Learning Part 1 SeppHochreiter Institute of Bioinformatics Johannes

Bioinformatics II - Machine Learning Sepp Hochreiter

Margin as Complexity Measure

3 Theor. background3.1 Model Quality 3.2 Generalization err.3.2.1 Definition 3.2.2 Estimation 3.2.2.1 Test Set3.2.2.2 Cross-Val.3.3 Exa.: Min. Risk 3.4 Max. Likelihood3.4.1 Unsupervised L.3.4.1.1 Projection 3.4.1.2 Generative 3.4.1.3 Par. Estimation3.4.2 MSE, Bias, Vari.3.4.3 Fisher/Cramer-R.3.4.4 ML Estimator3.4.5 Properties of ML3.4.5.1 MLE Invariant 3.4.5.2 MLE asymptot.3.4.5.3 MLE Consist.3.4.6 Expect. Maximi.3.5 Noise Models3.5.1 Gaussian Noise3.5.2 Laplace Noise 3.5.3 Binary Models3.5.3.1 Cross-Entropy3.5.3.2 Log. Reg.3.6 Stati. Learn. Theo.3.6.1 Error Bounds 3.6.2 Emp. Risk Min.3.6.2.1 Complexity3.6.2.2 VC-Dimension3.6.3 Error Bounds3.6.4 Struct. Risk Min.3.6.5 Margin

After this optimization: the distance of and to theboundary function is

x1 x2

1

kw∗k= γ

Theorem 1 (Margin Error Bound)The classification functions sign

¡wTx + b

¢are restricted to

kwk ≤ 1γ and kxk < R. Let ν be the fraction of training

examples which have a margin (distance to wTx + b = 0 )smaller than ρ

kwk .With probability at least of (1 − δ) of drawing l examples,

the probability to misclassify a new example is boundedfrom above by

ν +

sc

l

µR2

ρ2 γ2ln2 l + ln(1/δ)

¶,

where c is a constant.