03. PDF Estimation Corr

  • Upload
    biljets

  • View
    236

  • Download
    0

Embed Size (px)

Citation preview

  • 8/10/2019 03. PDF Estimation Corr

    1/43

    1Dipartimento di Ingegneria

    Biofisica ed ElettronicaUniversit di Genova

    Prof. Sebastiano B. Serpico

    3. Supervised estimation of

    probability density functions

  • 8/10/2019 03. PDF Estimation Corr

    2/43

    2

    Supervised Classifier Design

    Approach 1: Approach 2:

    Training setData set

    Estimation of

    the classconditional pdf

    Decision

    Theory

    {p(x| i)}

    training

    samples for

    the classes

    {i}

    Apply the decision

    rule to the data set

    Data set classification

    Training set Data set

    Training of anon-Bayesian

    classifier by a

    direct use of the

    training

    samples

    training

    samples for

    the classes

    {i}

    Apply the decision

    rule to the data set

    Data set classification

  • 8/10/2019 03. PDF Estimation Corr

    3/43

    3

    Supervised Estimation of a pdf

    The use of the decision theory to design classifiers requires topreliminary estimate the class conditional pdf. In a supervisedapproach, the estimation of the pdf p(x| i) can be performedon the basis of the trainingdata of the class i.

    Problem definition and notation:

    Consider a feature vectorx with (unknown) pdf p(x) and a finite

    set X = {x1, x2, ..., xN} of N independent samples extracted fromsuch a pdf;

    We would like to compute, on the basis of the available samples,an estimated pdf

    In order to perform supervised classification, the estimation

    process has to be repeated individually for each single class: inparticular, to estimate p(x| i), we assume that the set Xcorresponds to the set of trainingsamples of the class i.

    ( );p x

  • 8/10/2019 03. PDF Estimation Corr

    4/43

    4

    Approaches to pdf Estimation

    Parametric Estimation: a given model (e.g.: Gaussian,

    exponential,) for theanalytical form of p(x) isassumed;

    the parameters of such amodel are estimated.

    Remarks: a given model could be not

    physically realistic;

    most of the parametricmethods assume single-modepdfs, while many realproblems involve multimodalpdfs;

    complex methods (notconsidered here) have beendeveloped to identify thedifferent modes of a pdf.

    Non-Parametric Estimation :

    no analytical model is assumed

    for the pdf estimation, but p(x)is directly estimated from thesamples in X.

    Remarks:

    typically, the lack of predefined

    models allows more flexibility; however, the computational

    complexity of the estimationproblem is generally higherthan for the parametric case.

  • 8/10/2019 03. PDF Estimation Corr

    5/43

    5

    Parametric Estimation

    Given an analytical model of the pdf p(x) to be estimated, theparameters that characterize the model are collected into avector r.

    We highlight the dependence on the parameters by adopting thenotationp(x| ) (in particular,p(x| ), considered as function of ,is called likelihood function).

    The training samples x1, x2, ..., xN are collected into a singlevector of observationsX. The samples are considered as randomvectors and a pdfp(X| ) is associated to them

    usually the samples are assumed as identically distributedrandom vectors (because thay are all extracted from the same pdf

    p(x)) and independent from each other (i.i.d. random vectors, thatis, independent and identically distributed), then:

    1

    ( | ) ( | )N

    kk

    p p

    X x

  • 8/10/2019 03. PDF Estimation Corr

    6/43

    6

    General Properties of the Estimations

    General properties

    The estimate of the parameter vector depends on the observationvector:

    Therefore also the estimate is a random vector.

    Bias

    The expected value E{} of the estimation error iscalled biasand

    is defined as: The estimation is told to be not biased if, for each parameter

    vector , the estimation error has zero mean:

    To prove that the estimation of the parameter i (i= 1, 2, ..., r) isgood, we want that the estimation error i (i component of )has zero mean (i.e. that the estimation is not biased), but also ithas to have a smallvariance var{i}.

    ( ) X

    { } 0 or { }E E

    ( , ) X

  • 8/10/2019 03. PDF Estimation Corr

    7/43

    7

    Variance of the Estimation Error

    Cramr-Rao Inequality:

    For each unbiased estimation of the vector , it holds:

    where (

    ) = E{

    p(X| )

    p(X| )t} is the Fisher informationmatrix :

    The Cramr-Rao inequality provides a lower bound for thevariance of the estimation error.

    therefore var{i} cannot be made arbitrary small, but it will bealways lower bounded by [1()]

    ii. Then, the Fisher information

    matrix represents a measure of howgoodan estimation can be.

    In particular, an unbiased estimation that satisfies the equality foreach vector of the parameters is told to be efficient.

    1var{ } [ ( )] , 1,2,...,i ii i r

    ( | ) ( | )[ ( )]ij

    i j

    p pE

    X X

  • 8/10/2019 03. PDF Estimation Corr

    8/43

    8

    Asymptotic Properties of Estimations

    Often biased and/or nonefficient estimations are used,provided they exhibit a good behavior for largevalues of N.

    An estimation is called asymptotically unbiased if the error meanis zero for N+ :

    An estimation is told asymptotically efficient if the error variance

    of the estimation corresponds to the Cramr-Rao lower bound forN+ :

    An estimation is told consistent if it converges to the true valuein probability for N+ :

    Sufficient condition for an estimation to be consistent is thatitsasymptotically unbiased and that the estimation error hasinfinitesimal variance for N+ [Mendel, 1987].

    1

    var{ }lim 1, 1,2,...,

    [ ( )]i

    Nii

    i r

    lim { } 1 0N

    P

    N

    N

    E 0 that is E lim lim

  • 8/10/2019 03. PDF Estimation Corr

    9/43

    9

    ML Estimation

    Definition

    The Maximum Likelihood (ML) estimate of the vector isdefined as the following vector:

    Remarks

    For different values of , different pdfs are obtained. Each ofthem is computed in correspondence of the observations X. Thepdf assuming the maximum value for X is identified: the MLestimate is the value of that produces this pdf.

    Often its an advantage not to maximize the likelihood functionp(X| ), but (equivalently) the log-likelihood function:

    arg max ( | )p X

    arg maxln ( | )p X

  • 8/10/2019 03. PDF Estimation Corr

    10/43

    10

    ML Estimation: Example

    ML estimation of the mean of a one-dimensional Gaussianwith known variance (i.e., equal to one) starting from a singleobserved sample x0.

    -6 -4 -2 0 2 4 6

    x0

    0m x

  • 8/10/2019 03. PDF Estimation Corr

    11/43

    11

    Properties of the ML Estimation

    Under mild assumptions about the function p(X| ), it can beproven that, if an efficient estimation exists and if the MLestimation is unbiased, then the efficient estimation is the MLestimation.

    Even when an efficient estimation does not exist, the MLestimation exhibits goodasymptotic properties. In particular

    the ML estimation is: Asymptotically unbiased

    Asymptotically efficient

    Consistent

    Asymptotically Gaussian

    These properties explain the wide diffusion of ML estimatorsin classification methods.

    1( )

    , .N

  • 8/10/2019 03. PDF Estimation Corr

    12/43

  • 8/10/2019 03. PDF Estimation Corr

    13/43

    13

    Properties of the Parametric Gaussian Estimation

    The estimations of m and , as ML estimations, areasymptotically unbiased, asymptotically efficient andconsistent. Moreover, the following additional properties arevalid:

    The estimation of m is unbiased, while the estimation of isbiased:

    Therefore, usually, the estimation of is modified as follows:

    The two estimations coincide for N + (it is consistent withthe fact that ML estimations are asymptotically unbiased).

    The introduced estimation for the mean and the covariancematrix are called, generally, sample meanand sample covariance.

    1

    { } , { } N

    E E N

    m m

    1

    1 ( )( )

    1

    Nt

    k kkN

    x m x m

  • 8/10/2019 03. PDF Estimation Corr

    14/43

    14Iterative Expressions of theGaussian Parametric Estimation

    The estimations of m and can also be expressed with aniterative form, using each single sample in sequence, instead ofthe whole training set at once:

    Iterative computation of the sample mean:

    Iterative computation of the samplecovariance:

    ( )( 1) 1

    ( )

    1 (1) ( )1

    , 1,2, ..., 11

    1

    ,

    kk kk

    kh

    h N

    kk N

    k

    k

    m x

    mm x

    m x m m

    ( )( 1) 1 1

    ( )

    1 (1)1 1

    ( ) ( ) ( ) ( ) ( )

    , 1,2, ..., 111

    ,

    k tk k kk

    k th h

    h t

    k k k k t N

    kSS k N

    S kk S

    S

    x x

    x x

    x x

    m m

    The iterativeestimations of

    are referred to theexpression with

    denominator N.

  • 8/10/2019 03. PDF Estimation Corr

    15/43

    15

    Example (1/2)

    Given n= 3 featuresand two classes 1and 2, characterized by thefollowing training set:

    1: (0, 0, 0), (1, 1, 0), (1, 0, 0), (1, 0, 1)

    2: (0, 0, 1), (0, 1, 1), (1, 1, 1), (0, 1, 0)

    In this case it has no sense to normalize the features. In fact the featuresare binary and the samples are represented by all possible combinationsof the three binary features.

    We assume class conditional pdfs are Gaussian and we apply MLestimation. It is necessary to estimate the mean vector and thecovariance matrix for each class. Letssee the computation for the class

    1. The mean estimated value is provided by:

    1

    0 1 1 1 3 / 41

    0 1 0 0 1/ 44

    0 0 0 1 1/ 4

    m

    x1

    x2

    x3

  • 8/10/2019 03. PDF Estimation Corr

    16/43

    16

    Example (2/2)

    Estimation of the covariance matrix for 1 Letsuse as denominator N= 4, so obtaining a biased estimation.

    The use of N 1 = 3 would give an unbiased estimation.However, for large N, such as N> 30, we would obtain 1/(N1)1/N.

    1

    3 / 4 1 / 41

    1 / 4 3 / 4 1 / 4 1 / 4 3 / 4 1 / 4 3 / 4 1 / 44

    1 / 4 1 / 4

    1 / 4 1 / 4

    1 / 4 1 / 4 1 / 4 1 / 4 1 / 4 1 / 4 1 / 4 3 / 4

    1 / 4 3 / 4

    9 3 3 1 3 1 1 11

    3 1 1 3 9 34 163 1 1 1 3 1

    1 1 1 3

    1 1 1 1 1 31 1 1 3 3 9

    12 4 4 3 1 11 1

    4 12 4 1 3 164 16

    4 4 12 1 1 3

    In this case (notin general!) we

    obtain

    2 1

  • 8/10/2019 03. PDF Estimation Corr

    17/43

    17

    Non-Gaussian Parametric Estimation

    When a Gaussian modelappears not accurate for the

    considered problem, otherparametric models can beadopted.

    In the case n> 1, an extension ofthe gaussian model is given by

    the elliptically contoured pdfs:

    where m= E{x}, = Cov{x} andf is an appropriate non-negative

    function.

    The level curves of such pdfsare hyperellipses, like in thegaussian case.

    In the case n = 1, very generalmodels are the Pearsons pdfs,

    that, varying some parameters,include uniform pdfs, gaussianand also impulsive modelswith vertical asymptotes.

    1/ 2 1( ) [( ) ( )]tp fx x m x m

  • 8/10/2019 03. PDF Estimation Corr

    18/43

    18

    Non-Parametric Estimation: Problem Definition (1/2)

    In a non-parametriccontext the estimation of the unknown pdfp(x) is not restricted to satisfy any predefined model and it is

    directly built-up from the training samples x1, x2, , xN(assumed i.i.d.).

    Let x* be a generic sample and R a predefined region of thefeature space; Rincludes x*. Assuming that the true pdfp(x) isa continuous function and that R is enough small that suchfunction doesntvary in a significant way in R, we have:

    where Vis the n-dimensional volume (measure) of R.

    If Kis the number of trainingsamples belonging to R(over a totalof Ntraining samples), a consistent estimation of the probability Pis represented by the relative frequency:

    { } ( ) ( *)RR

    P P R p d p V x x x x

    ,RK

    PN

    lim { } 1 0R RN

    P P PLaw of large

    numbers

  • 8/10/2019 03. PDF Estimation Corr

    19/43

    19

    Non-Parametric Estimation: Problem Definition (2/2)

    Pdf estimation

    From the estimation of the probability PR that a sample belongsto R, we can derive an estimation of the pdf in the point x*:

    Remarks

    R has to be large enough to contain a number of trainingsamples that justify the application of the Law of large numbers;

    R has to be small enough to justify the hypothesis that p(x)doesntvary significantly in R.

    So, to obtain an accurate estimation, a compromise is necessary

    between these two needs, to guarantee a goodestimation of thepdf.

    However its not possible to obtain a good compromise, if thetotal number Nof samples in the trainingset is small.

    ( *) R

    P Kp

    V NV x

  • 8/10/2019 03. PDF Estimation Corr

    20/43

    20

    Two Non-parametric Approaches

    By exchanging the roles of the quantities K and V, the abovereasoning leads to two possible approaches to non-parametric

    estimation: the k-nearest-neighbor approach: for a fixed K and a given

    point xof the feature space, the region Rcontaining the K samplesnearest to x belonging to the training set, the hypervolume V iscomputed, and the estimation of the pdf is deduced;

    Parzen-window approach: for a fixed region R centered in x,whose hypervolume is equal to V, Kis computed considering thetraining set, then the estimation is derived.

    Its possible to prove that both approaches lead to consistent

    estimations. However, its not possible to draw generalconclusions about their behavior in a real context,characterized by afinitenumber of training samples.

  • 8/10/2019 03. PDF Estimation Corr

    21/43

    21

    K-Nearest-Neighbor Estimation

    Hypotheses

    The number of trainingsamples Kis preset.

    A reference cell(e.g., a sphere) centered in x* is considered.

    Methodology

    The k-nearest neighbor (k-nn) estimator extends the cell till itexactly contains K training samples: VK(x*) is the volume of the

    resulting cell. The pdf in the point x* is estimated as follow:

    It can be proved that, selected Kas a function of N(K= KN), thenecessary and sufficient condition for the k-nn estimation to beconsistentin all points where p(x) is continuous is KN+ forN+ , but of order lower than 1 (e.g., KN= N

    1/2).

    ( *)( *)K

    Kp

    N Vx

    x

  • 8/10/2019 03. PDF Estimation Corr

    22/43

    22

    Remarks on the k-nn Method

    Typically the cell used with k-nnis a hypersphere, then, the k-nnestimation is based on the following steps:

    Identify the Ktrainingsamples closest to the considered point x*(wrt an Euclidean metric);

    Identify the radius rof the smallest hypersphere that, centered inx*, includes the above Ksamples (rcoincides with the distance ofx* to the farthest one among the K samples);

    Compute the volume of the n-dimensional hypersphere of radiusrand then the value of the estimation ofp(x*).

    Disadvantages

    The pdf estimated by the k-nnmethod is not a truepdf, since

    its integral divergesbecause of the singularities due to the termVK(x*) at the denominator (e.g., V1(xk) = 0 for k= 1, 2, ..., N).

    The k-nn estimation is computationally heavy, even if ad hoctechniques have been proposed to decrease the computationalcharge (e.g., KD-Tree).

  • 8/10/2019 03. PDF Estimation Corr

    23/43

    23

    Parzen-Window Estimation: Introduction

    Hypotheses and notation

    Suppose R is an n-dimensional hypercubewith side h (and thenvolume V= hn), centered in the point x*.

    Introduce the following rectangular funcion:

    Introductionto the method

    The trainingsample xkbelongs to the hypercube Rwith center x*and side hif [(xkx*)/h] = 1, otherwise [(xkx*)/h] = 0.

    Then the number of the trainingsamples that fall into Ris:

    Consequently, the estimation can be computed as:1

    *N k

    k

    Kh

    x x

    1 1

    * *1 1 1( *)

    N Nk k

    n n

    k k

    Kp

    NV h N hNh h

    x x x xx

    elsewhere0

    0centerand1sidewithhypercubexif1(x)

    2

  • 8/10/2019 03. PDF Estimation Corr

    24/43

    24

    Parzen-Window Estimation

    The just illustrated estimation, based on the concept ofcounting the number of the training samples included into a

    prefixed volume, can be interpreted as the superposition ofrectangular contributions, each of which is associated to asingle sample.

    To obtain more regular estimations (the rectangular function isdiscontinuous), the previous expression is generalized.

    The pdf estimation is expressed as the sum of the Ncontributions, each of them is associated to a single sample, andthe single contribution is expressed by a function (), in generalnot rectangular, but such that () assumes real values that varywith continuity. The following estimation is obtained:

    The function () is called Parzen window or kernel and theparameter his the widthof the window (or of the kernel).

    1

    1 1( )

    Nk

    nk

    pN hh

    x x

    x

    25

  • 8/10/2019 03. PDF Estimation Corr

    25/43

    25

    Features of the Kernel Function

    In order that the Parzen-window estimation have sense, itsnecessary to impose restrictions on the kernel():

    Necessary and sufficient condition for the Parzen-windowestimation to be a pdf, is that the kernel function itself be a pdf(i.e., a non-negative and normalized function):

    Moreover some further conditions are accepted with the aimto obtain a goodestimation:

    () takes its global maximum in 0;

    () is continuous (this is necessary to guarantee that the

    estimation doesntvary suddenly or have discontinuities); (x) is infinitesimal for x (then the effect of a sample

    vanishes at large distances from the sample itself):

    lim ( ) 0

    x

    x

    ( ) 0 , ( ) 1n

    n dx x x x

    26

  • 8/10/2019 03. PDF Estimation Corr

    26/43

    26

    Examples of Kernel Functions for n= 1

    Rectangular kernel:

    Triangular kernel: Gaussian kernel:

    Exponential kernel:

    Cauchykernel: Kernelwith sinc2()behavior:

    21

    22( ) exp xx

    12( ) exp( )x x

    1

    x-1 1

    x

    x

    x

    x

    x1/2

    2

    1 1

    ( ) 1x x

    ( ) ( )x x

    ( ) ( )x x

    21 sin( / 2)

    ( )2 / 2

    xx

    x

    Triangular

    kernel

    Gaussian

    Kernel

    Exponentialkernel

    x

    x Cauchy kernel

    x

    x sinc2 kernel

    Here ()doesnt

    satisfy thecondition ofcontinuity. x-1/2 1/2

    x

    1

    Rectangular

    kernel

    27

  • 8/10/2019 03. PDF Estimation Corr

    27/43

    27

    Remark on the Parzen-Window Estimation

    Multidimensional case

    Often, in multidimensionalfeature spaces (n> 1) the choice of thekernel function is led back to the monodimensional case, byadopting:

    where ()is a monodimensional kernelfunction (i.e., one of thoselisted in the previous slide). In other words, the

    (multidimensional) kernel() has spherical symmetry and thebehavior, moving outward from the centre, is derived from ().

    Properties of the Parzen-window estimation

    It can be proved that generally the Parzen-window estimation isbiased.

    However, choosing the width hof the kernel as a function of thenumber Nof training samples (i.e., h= hN) and by imposing that{hN} be an infinitesimal sequence of order smaller than 1/n, theParzen-window estimation becomes asymptotically unbiasedand consistent (e.g., hN= N

    1/(2n)).

    ( ) ( ) x x

    28

  • 8/10/2019 03. PDF Estimation Corr

    28/43

    28Parzen-Window Estimation with a Finite Number of Samples

    The asymptotic properties of the Parzen-window estimationare derived by making the number of training samples

    approach infinity, which, obviously, itsnot realistic. With a finite training set, choosing h0, the estimation turns into

    a sequence of Dirac pulses centered on the single samples andthen exhibits an excessive variability. Instead, if his too large, aneccessive smoothing is generated.

    Therefore, the application of the method requires a high numberof trainingsamples, an adequate choice of the kernelfunction, anda compromise choice for the h value.

    Automatic algorithms exist (not described in this course) for theautomatic optimization [Scott et al., 1987] [Sain et al., 1974] oreven for the adaptive optimization [Mucciardi, Gose, 1970] of thekernel width.

    29

  • 8/10/2019 03. PDF Estimation Corr

    29/43

    29

    Remarks on the Parzen-Window Estimation

    Computational complexity

    Like the K-NN, also the Parzen-window estimation is

    computationally heavy. However approaches exist to reduce thecomplexity of the Parzen-window estimation (not presented inthis course).

    Probabilistic Neural Networks

    The Parzen-window estimation with multidimensional gaussiankernelswith spherical symmetry can be implemented by means ofa neural architecture called Probabilistic Neural Network (PNN)[Specht, 1990]

    30

  • 8/10/2019 03. PDF Estimation Corr

    30/43

    30

    Example (1)

    Parzen-windowestimates of a one-

    dimensional Gaussianpdf using a Gaussian

    kernelN(0, 1). Being n=1, we have considered:

    with constant h1> 0.

    1N

    hh

    N

    31

  • 8/10/2019 03. PDF Estimation Corr

    31/43

    31

    Example (2)

    Parzen-windowestimates of a bimodal

    one-dimensional pdf(one uniform and one

    triangular modes)using Gaussian kernels

    N(0, 1). The sameexpression of h

    N

    as forthe 1D Gaussian pdf

    has been adopted.

    32

  • 8/10/2019 03. PDF Estimation Corr

    32/43

    32

    Example (3)

    Parzen-windowestimates of a

    bidimensionalGaussian pdf using

    Gaussian kernels

    Being n= 2, we haveconsidered

    with constant h1

    > 0.

    0 1 0,

    0 0 1N

    14N

    hh

    N

  • 8/10/2019 03. PDF Estimation Corr

    33/43

    34

  • 8/10/2019 03. PDF Estimation Corr

    34/43

    34

    In particular, the estimate that presents the minimum meanquadratic error wrt the true pdf in the space of the m basis

    functions is searched for. Therefore, the minimization of the following functional is

    considered:

    The functional to minimize is a quadratic form into the coefficientsc1, c2, , cm. By imposing a simple condition of stationarity (nullgradient) we obtain:

    Minimization of the Quadratic Error

    22

    1 1 1

    1 1 1 1

    22

    1

    ( ) ,

    , , , ,

    2 ,

    m m m

    i i i i j ji i j

    m m m m

    i j i j i i j ji j i j

    m

    i i ii

    p p p c p c p c p

    c c c p c p p p

    c p c p

    10 , , 1,2,..., ,

    m

    i i i iii

    c p i m p pc

    35

  • 8/10/2019 03. PDF Estimation Corr

    35/43

    35

    Computation of the Optimal Coefficients

    Estimation of the coefficients of the expansion

    Taking into account that p(x) is a pdf defined over A, we can

    estimate the scalar product p, i(i= 1, 2, ..., m) by using a set oftraining samples :

    Approximation error Increasing the number of the basis-functions m, we can obtain

    estimations with smaller and smaller approximation errors: for m+ , we expect an infinitesimal error.

    In fact, the existence of complete orthonormal bases can be

    demonstrated, that is, sequencesof orthonormal functions {i(): i=1, 2, ...} such that any f function with finite energy can beexpanded as:

    1

    1, ( ) ( ) { ( )} ( )

    N

    i i i i i i kkA

    c p p d E cN

    x x x x x

    1

    , i ii

    f f The series converges inquadratic mean (with respect

    to the introduced norm)

    36

  • 8/10/2019 03. PDF Estimation Corr

    36/43

    36

    Choice of the Basis Functions (1)

    In general, being the true pdf unknown , its not possible to apriori identify the orthonormal basis that provides a given

    approximation error with the minimum number of coefficients. A choice on the basis of operational issues can be taken, such as the

    implementationsimplicity or the computation time.

    Examples of complete orthonormal bases in the case n= 1

    The goniometric functions form a complete orthonormal basis over[0, 2] (Fourier series expansion):

    Complete orthonormal bases can be generated (over variousdomains) by means of systems of orthogonal polynomials (Legendre,Hermite, Lagrange, Tchebitshev polynomials).

    1/2

    1/2

    1/2

    (2 ) for 1

    ( ) cos( ) for 2 ( 1,2,...)

    sin( ) for 2 1 ( 1,2,...)

    i

    i

    x rx i r r

    rx i r r

    37

  • 8/10/2019 03. PDF Estimation Corr

    37/43

    37

    Choice of the Basis Functions (2)

    Legendre polynomials

    They are a sequence of recursively defined polynomials:

    they are orthogonal into [1, 1]and need to be normalized:

    In the case n> 1, a complete orthonormal basis can be obtainedby multiplying one-dimensional basis functions:

    given a one-dimensional basis {i}, a bi-dimensional basis can bedefined as follow:

    23 12 2 21 1

    35 33 2 20 1

    2 1( ) ,( ) ( ) ( )

    1 1( ) , ...( ) 1, ( )

    i i i

    i iP x xP x xP x P x

    i iP x x xP x P x x

    1

    1

    2 1( ) ( ) ( ) ( )2 1 2

    i j ij i iP x P x dx x i P xi

    1 1 2 1 1 1 2 2 1 2 2 1 1 2

    3 1 2 1 1 2 2 4 1 2 2 1 2 2

    5 1 2 3 1 1 2

    ( , ) ( ) ( ) ( , ) ( ) ( )

    ( , ) ( ) ( ) ( , ) ( ) ( )

    ( , ) ( ) ( ) ...

    x x x x x x x x

    x x x x x x x x

    x x x x

    38

    f h l

  • 8/10/2019 03. PDF Estimation Corr

    38/43

    38

    Accuracy of the Functional Approximation

    The quality of the approximation depends on differentelements:

    orthogonalityof the basis functions over the region of the featurespace in which the samples take values;

    number mof the adoptedbasis functions .

    Number of basis functions

    The number mnecessary to reach a certain approximation errordepends on the chosen type of basis functions (i.e.: a sinusoidalp(x) will require, in general, less functions from a trigonometricbasis than from a polynomial one).

    In the lack of a priori information on p(x), for a given basis,

    typically m is derived inserting the estimated pdf into theadopted classifier, evaluating the performances on the test setandincreasing mtill reaching the desired accuracy.

    39

    E l (1)

  • 8/10/2019 03. PDF Estimation Corr

    39/43

    39

    Example (1)

    ML classification with estimations based on Legendre polynomials.

    Given two classes describedby the following training samples in a

    bi-dimensionalfeaturespace:

    Adopt a Legendre polynomials basis of order 4 (m= 4). Note that the

    featuresare normalized into [1, 1], corresponding to the orthogonalityinterval of the Legendre polynomials (if its not so, its sufficient tonormalize them on such an interval). The basis-functions are:

    1 1 11 1 2 0 1 0 22 2 2

    33 12 1 2 1 1 0 2 12 2 2

    3313 1 2 0 1 1 2 22 2 2

    3 3 34 1 2 1 1 1 2 1 22 2 2

    ( , ) ( ) ( )

    ( , ) ( ) ( )

    ( , ) ( ) ( )

    ( , ) ( ) ( )

    x x P x P x

    x x P x P x x

    x x P x P x x

    x x P x P x x x

    x1

    x2

    O 1

    1

    1

    1

    1

    2

    : ( 3 / 5,0), (0, 3 / 5), ( 3 / 5, 3 / 5) ,

    : (1,1), (4 / 5, 2 / 5), (3 / 5, 4 / 5).

    40

    E l (2)

  • 8/10/2019 03. PDF Estimation Corr

    40/43

    Example (2)

    Computation of the coefficients:

    For class 1:

    For class 2:

    3 3 3 31 11 1 1 13 5 5 5 5 2

    3 33 3 3 3 3 31 12 2 2 2 33 5 5 5 5 3 2 5 5 5

    3 3 3 3 3 9 91 14 4 4 43 5 5 5 5 3 2 25 50

    3 31 271 1 2 1 2 1 24 10 10 100

    ,0 0, ,

    ,0 0, , 0

    ,0 0, , 0 0

    ( | ) , , [ 1,1]

    c

    c c

    c

    p x x x x x xx

    31 4 2 4 11 2 2 23 5 5 5 5 2

    3 2 33 31 4 2 4 1 4

    2 2 2 23 5 5 5 5 3 2 5 5 53 11 331 4 2 4 1 2 4

    3 3 3 33 5 5 5 5 3 2 5 5 30

    3 3 8 91 4 2 4 1 124 4 4 43 5 5 5 5 3 2 25 25 10

    12 4

    (1,1) , ,

    (1,1) , , 1

    (1,1) , , 1

    (1,1) , , 1

    ( | )

    c

    c

    c

    c

    p x 3 11 271 2 1 2 1 25 20 20 , , [ 1,1]x x x x x x

    41

    E l (3)

  • 8/10/2019 03. PDF Estimation Corr

    41/43

    Example (3)

    Computation of the ML discriminant curve

    The discriminant curve is an equilateral hyperbola (i.e., it has

    orthogonal asymptotes):

    3 31 271 2 1 2 1 24 10 10 100

    3 91 11 27 17 271 2 1 2 1 2 1 24 5 20 20 10 20 25

    1

    1 2 1 2 2 1

    : ( | ) ( | )

    0

    9090 85 108 0

    108 85

    p p x x x x

    x x x x x x x x

    xx x x x x

    x

    x x

    If we used three basisfunctions, we would

    obtain a lineardiscriminant function.x1

    x2

    O 1

    1

    1

    1

  • 8/10/2019 03. PDF Estimation Corr

    42/43

  • 8/10/2019 03. PDF Estimation Corr

    43/43