5
Advanced Review Transductive confidence machines James Rogers The field of statistical learning theory has developed alternatives to induction. Instead of using all the available points to induce a model, the data, or usually a small subset of the data, can be used to estimate unknown properties of points to be tested (e.g., membership to a class). This idea leads to algorithms that use standard statistical tests to compute the confidence on the estimation. Using transduction, researchers have built transductive confidence machines which are able to estimate the unknown class of a point and attach confidence to the estimate, and also to determine outliers in a data set. 2011 John Wiley & Sons, Inc. WIREs Comp Stat 2 0 1 1 3 216–220 DOI: 10.1002/wics.154 INTRODUCTION T ransduction is reasoning from training data to estimate properties of test data. Differing from many machine learning methods, transduction provides measures of reliability to individual data instances and assumes only that the data instances are independent and generated by the same stochastic mechanism. Using transduction, transductive confidence machines (TCM) were built first to estimate the unknown class of a data instance and later used to determine outliers in a data set. TCM attach a statistical confidence to its classifications and outlier determinations. The TCM methodology has its theoretical foundations in the theory of algorithmic randomness. The confidence measure used in TCM is based upon universal tests for randomness. Universal tests are not computable, and must be approximated using nonuniversal tests called p-values. The p-value is defined as the probability of observing a data instance in the sample space that can be considered more extreme than a sample of data. TCM use a p-value function called strangeness measure or nonconformity score. The strangeness measure corresponds to the uncertainty of the point being measured with respect to the members of a class or cluster. Many different strangeness measures are possible, but the one most often utilized for TCM is nearest-neighbor distance. This article is a U.S. Government work, and as such, is in the public domain in the United States of America. Correspondence to: [email protected] U.S. Army Engineer Research and Development Center, Alexandria, VA, USA DOI: 10.1002/wics.154 INDUCTION Induction is inferring a general rule from observed training data and applying the rule to the test data. For induction, a model is selected and then the test data is compared to the model to determine the fit of the data to the model. The inductive approach for classifying unlabeled test points is to use the labeled points to train a supervised classification algorithm, and then have the algorithm predict the classes for the unlabeled test points. TRANSDUCTION Transduction is reasoning from training data to estimate properties of the test data. The transductive reliability estimation process has its theoretical foundations in the algorithmic theory of randomness developed by Kolmogorov. 1 Transduction 2 means that, given the classifications of the points in the training data set, the objective is to try to determine the classifications of the points in the test data set. Unlike traditional methods in machine learning, transduction can offer measures of reliability to individual examples, and uses very broad assumptions. It only assumes that the data points are independent and generated by the same stochastic mechanism. Transduction predictions may not be achievable from an inductive model. An advantage of transduction over induction is that it is possible with transduction to make better predictions with fewer labeled training points. An interpretation of transduction 3 is that unknown estimates for individual points of interest can be made directly from the training data, as 216 2011 John Wiley & Sons, Inc. Volume 3, May/June 2011

Transductive confidence machines

Embed Size (px)

Citation preview

Advanced Review

Transductive confidencemachinesJames Rogers∗

The field of statistical learning theory has developed alternatives to induction.Instead of using all the available points to induce a model, the data, or usually asmall subset of the data, can be used to estimate unknown properties of pointsto be tested (e.g., membership to a class). This idea leads to algorithms thatuse standard statistical tests to compute the confidence on the estimation. Usingtransduction, researchers have built transductive confidence machines which areable to estimate the unknown class of a point and attach confidence to the estimate,and also to determine outliers in a data set. 2011 John Wiley & Sons, Inc. WIREs CompStat 2011 3 216–220 DOI: 10.1002/wics.154

INTRODUCTION

Transduction is reasoning from training datato estimate properties of test data. Differing

from many machine learning methods, transductionprovides measures of reliability to individualdata instances and assumes only that the datainstances are independent and generated by thesame stochastic mechanism. Using transduction,transductive confidence machines (TCM) were builtfirst to estimate the unknown class of a data instanceand later used to determine outliers in a data set. TCMattach a statistical confidence to its classifications andoutlier determinations. The TCM methodology has itstheoretical foundations in the theory of algorithmicrandomness. The confidence measure used in TCM isbased upon universal tests for randomness. Universaltests are not computable, and must be approximatedusing nonuniversal tests called p-values. The p-value isdefined as the probability of observing a data instancein the sample space that can be considered moreextreme than a sample of data. TCM use a p-valuefunction called strangeness measure or nonconformityscore. The strangeness measure corresponds to theuncertainty of the point being measured with respectto the members of a class or cluster. Many differentstrangeness measures are possible, but the onemost often utilized for TCM is nearest-neighbordistance.

This article is a U.S. Government work, and as such, is in the publicdomain in the United States of America.∗Correspondence to: [email protected]

U.S. Army Engineer Research and Development Center, Alexandria,VA, USA

DOI: 10.1002/wics.154

INDUCTION

Induction is inferring a general rule from observedtraining data and applying the rule to the test data.For induction, a model is selected and then the testdata is compared to the model to determine the fitof the data to the model. The inductive approach forclassifying unlabeled test points is to use the labeledpoints to train a supervised classification algorithm,and then have the algorithm predict the classes for theunlabeled test points.

TRANSDUCTION

Transduction is reasoning from training data toestimate properties of the test data. The transductivereliability estimation process has its theoreticalfoundations in the algorithmic theory of randomnessdeveloped by Kolmogorov.1 Transduction2 meansthat, given the classifications of the points in thetraining data set, the objective is to try to determine theclassifications of the points in the test data set. Unliketraditional methods in machine learning, transductioncan offer measures of reliability to individualexamples, and uses very broad assumptions. Itonly assumes that the data points are independentand generated by the same stochastic mechanism.Transduction predictions may not be achievable froman inductive model. An advantage of transductionover induction is that it is possible with transductionto make better predictions with fewer labeled trainingpoints. An interpretation of transduction3 is thatunknown estimates for individual points of interestcan be made directly from the training data, as

216 2011 John Wiley & Sons, Inc. Volume 3, May/June 2011

WIREs Computational Statistics Transductive confidence machines

opposed to using induction to infer a general rulefor them.

Algorithmic Randomness TheoryMany machine learning algorithms output predic-tions, but do not provide a confidence in thosepredictions. Algorithmic randomness theory4 pro-vided universal measures of confidence, but they arenoncomputable. The problem of assigning predic-tions is connected to the problem of defining randomsequences. Kolmogorov1 believed randomness to be afoundation for applications of probability.

Sequences that do not seem to have any specificpattern in their elements would be more typical thansequences in which one can find regularities.5 Fromthe Algorithmic Randomness Theory, there existsa universal method of finding regularities in datasequences. Martin-Lof, who was the first to introducethe notion of a randomness test,1 stated that a functiont : Z∗ → [0, 1] is a test for randomness with respectto a class of probability distributions Q in Z if forall n ∈ N, for all s ∈ [0, 1], and for all probabilitydistributions P in Q,

Pn{z ∈ Zn : t(z) ≤ s} ≤ s (1)

where t is semicomputable.

Strangeness MeasureUsers of transduction as a test of confidence haveapproximated a universal test for randomness, whichin its general form is noncomputable, by usinga p-value function called strangeness measure6 ornonconformity score.4 There is more than a singledefinition of strangeness measure. The general ideais that the strangeness measure corresponds to theuncertainty of the point being measured with respectto all the other labeled examples of a class. A higherstrangeness measure means a higher uncertainty.

In Ref 5 the strangeness is defined as the ratiobetween the sum of the distances from the pointto its k-nearest neighbors inside the class underconsideration and the sum of the k-nearest neighborsoutside the class. The strangeness αiy of a point i withrespect to a class y is defined as follows:

αiy =∑k

j=1 Dyij

∑kj=1 D−y

ij

(2)

where Dyij is the sequence of distances between point

i and points in the class y, and D−yij represents the

sequence of distances between i and points in otherclasses different from y.

This definition of strangeness is based on thenearest neighbors to the test point in and out of theparticular class for which the point is being tested.The strangeness being the ratio of the sum of the ksmallest distances from the same class proposed to puti in, to the sum of the k smallest distances from otherclasses. A larger value of α means the stranger the testpoint.

p-ValueTransduction has been used to provide confidencemeasures for the decision of labeling a point asbelonging to one of the predefined classes (seeRefs 5–7). The confidence measure is based uponuniversal tests for randomness or their approximation.A Martin-Lof randomness deficiency test1 based onsuch tests is a universal version of the standardp-value. Martin-Lof proved that there exists auniversal test for randomness smaller than any othertest up to a multiplicative constant. Unfortunately,universal tests are not computable and have to beapproximated using nonuniversal tests called p-values.The p-value is defined as the probability of observinga point in the sample space that can be consideredmore extreme than a sample of data. This p-valueserves as a measure of how well the data supports ordoes not support a null hypothesis. A p-value is themaximum probability under the null hypothesis of thetest statistic assuming a value equal to the observedoutcome or a value just as extreme or more extremewith respect to the direction indicated by alternativehypothesis than the observed outcome. The smaller thep-value means the smaller is the chance that the teststatistic could have assumed a value as incompatiblewith the null hypothesis if the null hypothesis (‘classy is a good fit for point i.’) is true. The smallerthe p-value, the greater the evidence against the nullhypothesis. A p-value function can be defined, in termsof α, as follows:

t(z1, z2, . . . , zn) = #(i = 1, . . . , n : αiy ≥ αny)n

. (3)

If zn is the point in question, the function t() willmeasure the probability of having points already inthe class with strangeness greater than or equal to thatof zn.

TRANSDUCTIVE CONFIDENCEMACHINES

The TCM6 introduced the computation of theconfidence using Algorithmic Randomness Theory.1

Volume 3, May/June 2011 2011 John Wiley & Sons, Inc. 217

Advanced Review www.wiley.com/wires/compstats

The confidence measure used in TCM is based uponthe calculation of p-values. TCM is a predictionmethod providing p-values, py for any possible valuey of the unknown label yn.8 The prediction algorithmcan be given a threshold δ as an input and its answerwill be that the label yn should be in the set ofsuch y that py > δ. The validity property states thatthe set prediction will be wrong with probability atmost δ, so some maximal probability of error canbe guaranteed.8 At any confidence level, 1 − δ, thefrequency of errors is guaranteed not to exceed δ

provided the examples are generated independentlyfrom the same probability distribution.9 The value δ

is called the significance level and the value 1 − δ iscalled the confidence level.

Transductive Confidence Machines-kNearest NeighborsThe algorithm of transductive confidence machines-knearest neighbors (TCM-kNN)5 as shown in Figure 1attempts to place a new point in each class of the data.While doing that, it may force the updation of someof the strangeness values for the training data. Thishappens whenever the distance between the trainingpoint and the new one is less than the largest of thek distances that are used to compute the strangeness.After the strangeness values are calculated, it thencomputes one p-value for each of the attempts (i.e.,for each class placement) and predicts that the pointbelongs to the class with the largest p-value, with aconfidence equal to one minus the second p-value.

Let there be training examplesFor = 1 to do

Calculate andCalculate value for each example

Let there be classes and be the new example

For = 1 to doFor every training example , classified as do

If >

>

( )(if the largest distance to one of the -neighborsin class for is bigger thanthe distance of to the new example )

Recalculate the value for (with in )For every training example , classified as non- do

If ( )(if the largest distance to one of the -neighbors

outside class for is bigger thanthe distance of to the new example )

Recalculate the value for (with in )Calculate value forCompute the p-value of the point for class

Predict the class with the largest p-value for

(to be classified)

Output as confidence one minus the 2nd p-value

FIGURE 1 | The TCM-kNN algorithm.

If there are m training points, and a sequence ofstrangeness values (α1, . . . , αm), and a new strangenessvalue for a point i being tested, αnew, is calculated,then αnew can be placed in the sorted sequence withthe same probability as all permutations of the newsequence are equiprobable. Thus, the probability thatαnew is among the j largest occurs with probability ofat most j/(m + 1). For each possible classification ofa test point, the strangeness value of the test pointis added to the sequence of strangeness values ofthe training data. The prediction for each test pointis the classification that provides the most typicalcompletion of the sequence of strangeness measuresof the training data set.5

Each prediction is accompanied by a confidencemeasure. By Eq. (1), the second largest p-value is anupper bound on the probability that the classificationswill be correct. The confidence measure indicateshow likely the predicted classification is the correctclassification.

APPLICATIONS

ClassificationInitially, TCM was used to classify new test points andassign a confidence to the classification.5 A trainingdata set (x1, y1), . . . , (xm, ym) has m elements, wherexi = (x1

i , . . . , xni ) is the set of feature values of i, and yi

is the classification for example i, with a finite set ofpossible classifications. There is a set of test examplessimilar to the training set, but the actual classificationsare not known. The goal is to determine for each andevery test example one of the possible classifications.For every classification, a confidence will be providedin that prediction.

The sequence is sorted in ascending order ofthe distances of example i from the other exampleswith the same classification y as Dy

ij, and D−yij for

the sorted sequence of distances containing exampleswith classification different from y. The individualstrangeness measure is assigned to every example. Thestrangeness measure for an example i with label yis defined as in Eq. (2), where k is the number ofneighbors. The strangeness of an example increaseswhen the distance from the examples of the same classbecome larger or when the distance from the otherclasses becomes smaller.

OutliersThe outlier determination method described in Ref10 makes use of the concept of transduction to findoutliers. In determining outliers, the interest is inlearning the likelihood that a point belongs to a

218 2011 John Wiley & Sons, Inc. Volume 3, May/June 2011

WIREs Computational Statistics Transductive confidence machines

given cluster in the current clustering model. Thetraining data are the typical and the test points arecompared to these typical points. These propertiesmake transduction an ideal mechanism to detectoutliers, even though it was never used before forthat purpose. In Ref 10, the ideas of TCM are usedto design a test that determines if a point is an outlierand attach a confidence to the estimation.

If the point i being tested is an outlier, accordingto the definition provided in Ref 5, the strangeness ofi becomes the ratio between two large numbers, sincethe distances from the point in question, i, to any ofthe points within clusters are large. In some cases,this ratio will be small enough to be comparableto the strangeness values for points already in thecluster which will lead to false negatives. Experimentsshow this to be true. For determining outliers, thedefinition of the strangeness function is modified tofinding estimates for points to be outliers with respectto clusters. Points in a given cluster of the modelare treated as belonging to the same label or class,even though such class has not been defined. This ispossible since points in the same cluster exhibit enoughsimilarity, as measured by the underlying clusteringalgorithm, among themselves to be commonly labeledas belonging to a class. For determining outliers, thestrangeness αiy of a point i with respect to a cluster yis defined as follows:

αiy =k∑

j=1

Dyij. (4)

This definition makes the strangeness value ofa point far away from the cluster considerably largerthan the strangeness of points already inside the clusterand has been employed by Angiulli11 as a measure ofisolation. This strangeness method uses the Euclideandistance to compute the distance between pairs ofpoints. This definition is very different from the oneutilized by TCM-kNN, because it does not divide thesum of the nearest neighbors within the cluster by thesum of nearest neighbors outside the cluster.

Using Eq. (4), a series of p-values can becomputed for i, with one for each cluster. Each p-value is computed using Eq. (3) as the fraction ofpoints in the cluster, including i, that have strangenessgreater than or equal to that of the point i beingdiagnosed.

Each cluster provides a p-value, so a series ofp-values, with one for each cluster, is obtained. Thelargest p-value in this series is called pmax. Thisprovides a way of testing the fitness of point i, bytesting the null hypothesis Hy

0 as ‘i is fit to be in clustery.’ Thus, the alternative hypothesis Hy

1 is ‘i is ill-fit to be

in cluster y.’ Selecting a confidence level 1 − τ whichis usually 95%, tests are conducted to determine ifpmax ≤ τ . If true, all the null hypotheses are rejectedand the point is declared an outlier. Otherwise, all thealternative hypotheses are rejected. By choosing τ , thepercentage of outliers is not being chosen. Instead, τ

selects the confidence level for the hypothesis testing.A τ of 5% does not necessarily result in 5% of thepoints being identified as outliers as experiments willcorroborate, but τ controls when the null hypothesiswill be rejected and indicates the maximum error thatwill be incurred by that decision.

The pseudo-code of the strangeness-basedoutlier detection (StrOUD) algorithm10 that computesthe fitness of a point i to be included in existingclusters, is shown in Figure 2. It contains a rule toaccept or reject alternative hypotheses for the fitnessof i to be included in existing clusters. The algorithmis not used to decide the placement of the point inthe clusters. That decision is left to the clusteringalgorithm of choice. If no clustering informationis available, the algorithm will test a single nullhypothesis H0 as i is fit to be considered part ofthe general population of points.

Spatial–Temporal OutliersTo incorporate the effects of space and time, threekernels are embedded into the determination ofoutliers method, StrOUD. One kernel is for thefeature measurements, the second kernel is for thespatial coordinates, and the third kernel is for time.α1(df) + α2(ds) + α3(dt) and α1 + α2 + α3 = 1 wheredf = feature distance, ds = spatial distance, anddt = temporal distance. The total distance is thesum of the feature distance, spatial distance, andtemporal distance which are all normalized. This totaldistance is then utilized as the distance measurementin StrOUD. The distances are only computed withrespect to points that occur before the time of thepoint being tested. The algorithm for determiningoutliers in spatial–temporal data is shown in Figure 3.

CONCLUSION

Using transduction, TCM is a method that is capableof classifying points and determining outliers bygenerating a prediction with a confidence measurefor that prediction. TCM, which has its foundation inalgorithmic randomness theory, utilizes a strangenessmeasure and the calculation of p-values to makepredictions. TCM only assumes that the datapoints are independent and generated by the samestochastic mechanism, and has been used successfullyto estimate membership in a class and to detect

Volume 3, May/June 2011 2011 John Wiley & Sons, Inc. 219

Advanced Review www.wiley.com/wires/compstats

FIGURE 2 | StrOUD algorithm.

Given a point under consideration:Compute the -values of with respect to clusters 1, ⋅⋅⋅ .Sort the -value list in order.Call the highest -value, and the next in the list.If ≤ then

Reject all the null hypotheses 0, for = 1, ⋅⋅⋅ Declare an outlier with confidence 1

ElseReject all the alternative hypotheses ( belongs to a cluster in the model).

FIGURE 3 | Determining outliers in spatial–temporaldata algorithm.

Let there be training examplesFor = 1 to do

Calculate and normalize, and= ( ) + ( ) + ( )

(where + + =1)For = 1 to do

For every test example,Calculate and normalize , and

= ( ) + ( ) + ( )If <

Calculate values for the vector of features, spatial coordinates, and time= + +

Compute the p-value of the point usingOutput as confidence one minus the p-value

outliers in a data set to include spatial–temporal datasets. Empirical evaluations conducted using TCM

consistently produced very good results for both clas-sification and outlier detection.

REFERENCES1. Li M, Vitanyi P. Introduction to Kolmogorov Complex-

ity and its Applications. 2nd ed. New York: Springer;1997.

2. Gammerman A, Vovk V, Vapnik V. Learning bytransduction. In Proceedings of the 14th Annual Con-ference on Uncertainty in Artificial Intelligence; 1998,148–155.

3. Vapnik V. Statistical Learning Theory. New York:Wiley; 1998.

4. Vovk V, Gammerman A, Saunders C. Machine learningapplications of algorithmic randomness. In Proceed-ings of the 16th International Conference on MachineLearning; 1999, 444–453.

5. Proedrou K, Nouretdinov I, Vovk V, Gammerman A.Transductive confidence machine for pattern recogni-tion. In Proceedings of the 13th European Conferenceon Machine Learning, Vol 2430; 2002, 381–390.

6. Gammerman A, Vovk V. Prediction algorithms andconfidence measures based on algorithmic randomnesstheory. Theor Comput Sci 2002, 287:209–217.

7. Ho S, Wechsler H. Transductive confidence machinefor active learning. International Joint Conference onNeural Networks, Portland, OR; 2003.

8. Nouretdinev I, Vovk V. Criterion of Calibration forTransductive Confidence Machine with Limited Feed-back. New York: Springer; Lecture Notes in ArtificialIntelligence, Vol 2842; 2003, 259–267.

9. Vovk V. Asymptotic optimality of transductive confi-dence machine. In Proceedings of the 13th InternationalConference on Algorithmic Learning Theory; 2002,336–350.

10. Barbara D, Domeniconi C, Rogers J. Detecting outliersusing transduction and statistical testing. In Proceedingsof the 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining; 2006, 55–64.

11. Angiulli F, Pizzuti C. Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng2005, 17:203–215.

220 2011 John Wiley & Sons, Inc. Volume 3, May/June 2011