Approximation of Boolean Functions by Local Search

Computational Optimization and Applications, 27, 53–82, 2004c© 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

Approximation of Boolean Functionsby Local Search

ANDREAS ALBRECHT [email protected] of Computer Science, University of Hertfordshire, Hatfield, Herts AL10 9AB, UK

CHAK-KUEN WONG∗Department of Computer Science and Engineering, The Chinese University of Hong Kong,Shatin, N.T., Hong Kong

Abstract. Usually, local search methods are considered to be slow. In our paper, we present a simulated annealing-based local search algorithm for the approximation of Boolean functions with a proven time complexity thatbehaves relatively fast on randomly generated functions. The functions are represented by disjunctive normalforms (DNFs). Given a set of m uniformly distributed positive and negative examples of length n generated by atarget function F(x1, . . . , xn), where the DNF consists of conjunctions with at most � literals only, the algorithmcomputes with high probability a hypothesis H of length m · � such that the error is zero on all examples. Ouralgorithm can be implemented easily and we obtained a relatively high percentage of correct classifications ontest examples that were not presented in the learning phase. For example, for randomly generated F with n = 64variables and a training set of m = 16384 examples, the error on the same number of test examples was about19% on positive and 29% on negative examples, respectively. The proven complexity bound provides the basisfor further studies on the average case complexity.

Keywords: local search, simulated annealing, Boolean functions, machine learning

1. Introduction

Local search methods are based on the idea that elements of a configuration space F maybe improved by making small changes in configurations [2]. Thus, the execution of a localsearch algorithm defines a walk in F such that each solution visited is a neighbor of thepreviously visited one. Local search algorithms have the reputation to be slow, but to producefairly good solutions in a still managable time.

The main variants of local search methods are threshold algorithms, taboo search al-gorithms, variable-depth search algorithms, and the more complex approach of geneticalgorithms. For a detailed discussion of these methods we refer to [2]. Among thresholdalgorithms we will focus on simulated annealing [1]. This local search method was intro-duced in [11, 17] as a new approach to calculate approximate solutions of combinatorialoptimization problems.

We apply simulated annealing to the following approximation problem of Boolean func-tions: We are given a set S ⊆ {0, 1}n of uniformly distributed binary n-tuples that represent

∗On leave from IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY, USA.

54 ALBRECHT AND WONG

positive and negative examples of a target Boolean function F : {0, 1}n → {0, 1}. Wehave F(σ1, . . . , σn) = 1 for positive examples σ ∈ S, and F(η1, . . . , ηn) = 0 for neg-ative examples η ∈ S. About F(x1, . . . , xn) it is known that there exists a represen-tation by a disjunctive normal form DNF(F) = ∨

σi1 σi2 ...σilx

σi1i1

& xσi2i2

& . . . &xσilil

suchthat each conjunction x

σi1i1

&xσi2i2

& . . . &xσilil

consists of at most l ≤ � literals xσ , whereσ ∈ {0, 1} (here, we use x1 ≡ x and x0 ≡ x , i.e., x0 = 1 for x = 0, and x0 = 0 forx = 1).

The task is to find a short representation H for F such that the error of H is small onS. The function H is also represented by a disjunctive normal form (DNF). Thus, the onlyinformation about F are the set S, the number of variables n which is equal to the lengthof examples from S, and the parameter � that defines the maximum length of conjunctionsin a DNF representation of F .

For this particular problem we want to show that the local search method simulated an-nealing can be competitive compared to the best known algorithms. For |S| = m and anequal number of positive and negative examples, our algorithm computes with high proba-bility a hypothesis H of length � ·m/2 with zero error on all presented examples S. Besidesa proven time complexity bound, we provide results from computational experiments onrandomly generated DNF where H is tested on additional examples that were not presentedduring the learning phase. Thus, when � from F is significantly smaller than n, we obtainindeed a short representation compared to the trivial solution of length n ·m/2, where simplyfor all positive examples the corresponding conjunctions are taken together in a disjunctivenormal form formula (in another setting, the similar problem of feature minimization isdiscussed in [7]).

Unlike the different methods discussed in [20], we are aiming a priori at the combinationof short representations and high classification rates. Thus, the most time-consuming partof our approach is the search for short conjunctions rejecting all negative examples andaccepting as much as possible positive examples.

The approximation problem is closely related to the learnability of DNF in the contextof VALIANT’S PAC learning model [24]. For the general case of DNF, the learnability frompositive and negative examples is still an open problem. Therefore, various subclasses havebeen studied within different modifications of the PAC learning model; see [5, 16, 23]. Foran overview on DNF learnability and the discussion of different learning models, includinga comprehensive list of references, we refer to [3].

In the PAC learning model, the probability distributions D(σ ) of positive and negativeexamples σ {0, 1}n are not specified and the hypothesis has to be found in polynomial time,i.e., the learning algorithm has to provide with probability at least 1 − δ a hypothesis withan error of at most ε on the entire set {0, 1}n , regardless of D(σ ). In some papers, this con-dition has been weakened, i.e., subexponential algorithms have been designed for specialclasses of DNF under the uniform probability distribution D(σ ) = 2−n . In [25], DNF con-sisting of nO(1) terms are ε-approximated with confidence 1 − δ from positive and negativeexamples in time nO(log(n/(ε·δ))) under the uniform probability distribution. In the member-ship query model, the time bound nO(log log(n/(ε·δ))) was obtained by Mansour [19]. Withinthe same model, J. Jackson has shown that there exists even a polynomial time learningalgorithm [15].

APPROXIMATION OF BOOLEAN FUNCTIONS BY LOCAL SEARCH 55

Our algorithm is very similar to Verbeurgt’s approach [25], but the complete searchfor conjunctive terms of length � maximizing the number of satisfied positive examplesis replaced by a simulated annealing search for a conjunction that satisfies at least a fixedpositive example and rejects all negative examples. The search is performed for any positiveexample separately, and after all positive examples have been processed, the conjunctiveterms with the largest number of satisfied positive examples that cover the entire set ofpositive examples are taken together.

We provide a complexity analysis that employs Hajek’s theorem on logarithmic coolingschedules [14] for inhomogeneous Markov chains. We obtain an upper bound of O(n�) +(3/δ)2·� transitions to find with probability 1 − δ a conjunction with � literals that acceptsat least one positive example and rejects all negative examples, where � ≤ � is a specificparameter of the associated energy landscape. Although � = � cannot be excluded for theworst case (i.e., the time complexity is approximately the same as for Verbeurgt’s method),the parameter � depends on the sample set S and can be significantly smaller than �. Thus,further studies will be directed towards an analysis of average � for randomly chosenS. Thegeneral framework has been previously applied by the authors to simulations of physicalsystems [4].

The algorithm has been implemented and tested on instances with different parametersettings. The number of test examples was the same as the number of training samples. Agood classification rate could be achieved if the ratio of training examples per conjunctionis sufficiently large. For example, a correct classification of about 19% on positive and29% on negative examples could be achieved for n = 64, � = 10, and m = 16384learning examples. A much higher classification rate can be expected on structured data.The problem size is derived from applications where objects have to be classified accordingto a number of features, e.g., as in computer assisted medical diagnosis [20]. Future researchwill include applications to databases derived from real world problems such as diabetesdiagnostics.

The paper is organized as follows: In Section 2, we introduce the basic notations, includingthe main features of simulated annealing algorithms. In Section 3, we describe the simulatedannealing procedure that computes with a certain confidence (1 − δ) a term of length � thatsatisfies a given positive example and rejects all negative examples. From this particularprocedure, the overall algorithm approximating DNF is derived. Results from computationalexperiments are also reported in this section. Finally, the convergence and run-time analysisof our local search method is presented in Section 4.

2. Basic definitions

2.1. The approximation problem

We consider Boolean functions f of n variables that can be represented by disjunctivenormal forms F containing at most nO(1) conjunctive terms t = x

σ j1i1

&xσ j2i2

& · · · &xσ jl( j)

il. The

corresponding class of Boolean functions is denoted by F := { f : F( f ) = ∨i∈I ti , |I | ≤

s}, where s = nO(1) and the ti �≡ 0 are conjunctions of literals xσ ∈ {x1, . . . , xn, x1, . . . , xn};x1 ≡ x and x0 ≡ x . For the number l(ti ) of literals we assume 1 ≤ l(ti ) ≤ � ≤ n. We denote


by NEG, POS ⊂ {0, 1}n negative and positive examples of F , i.e., σ ∈ NEG ⇒ F(σ ) = 0and σ ∈ POS ⇒ F(σ ) = 1.

The total number of literals in F is called the length (or complexity) of F ; e.g., for sconjunctions each consisting of � literals the length is s · �.

We suppose that information about f ∈F is provided by NEG ∪ POS only, and we denoteby m = |NEG|+|POS| the total number of examples. We note that elements of NEG ∪ POSare provided as complete assignments σ = σ1σ2 . . . σn , i.e., the structure of conjunctionscannot be recognized in advance from the given examples. To simplify notations, we assume|NEG| = |POS| = m/2.

Now, the approximation problem considered in the present paper can be formulated asfollows: The problem instance is given by [NEG, POS] for an unknown f ∈ F , and theaim is to compute a hypothesis H , represented by a DNF, such that the error of H on NEG∪ POS is “small”, and, additionally, H is “significantly shorter” compared to the trivialrepresentation by n · m/2 literals.

The approximation problem is related to algorithmic learning as well as to data compres-sion: When the input instances σ ∈ {0, 1}n are distributed uniformly, the error for DNFwith s conjunctions each containing n literals is smaller than ε if conjunctions of lengthlog (s/ε) are taken: For a single conjunction of this length (i.e., when one of the s conjunc-tions is reduced to � literals), the number of misclassified assignments is upper boundedby 2n−� = 2n · ε/s. Thus, for s conjuctions and the uniform distribution 2−n the overallprobability of incorrect classification is bounded by ε.

2.2. General structure of simulated annealing

Our local search method to solve the approximation problem is based on the general frame-work of simulated annealing which was introduced in [11, 17] as a new approach to calculateapproximate solutions of combinatorial optimization problems. Detailed information aboutthis method and applications in different areas can be found in [2, 6, 10, 18, 21, 22, 26].In the present paper, we consider simulated annealing procedures for the special type oflogarithmic cooling schedules.

Simulated annealing algorithms are acting within a configuration space in accordancewith a certain neighborhood structure, where the particular transitions between adjacentelements of the configuration space are controlled by the value of an objective function.Now we are going to introduce these notions for the basic step of our approximationalgorithm which is the search for a single conjunction of length � that accepts at least oneσ ∈ POS and rejects all η ∈ NEG.

2.2.1. Configuration space and objective function. We define the configuration space foreach particular positive example σ ∈ POS, i.e., the local search method is described for asingle term t = x

σ j1i1

& xσ j2i2

& · · · & xσ jl( j)

ilof the entire hypothesis H for the Boolean function

f ∈ F . Thus, the underlying configuration space is defined by

C := {t : t(σ ) = 1; � ≤ l(t) ≤ n; ∀η(η ∈ NEG → t(η) = 0)}. (1)


For shorter notations, we do not indicate that C is specifically related to σ . The number ofterms t ∈ C is upper bounded by

n∑l=�

(n

l

)< 2n; i.e., |C| < 2n. (2)

For the definition of the neighborhood relation, we consider t, t ′ ∈ C: The two termst, t ′ �≡ 0, t �= t ′ are called neighboring, if t ′ can be generated from t by deleting a literal int or adding xσi

i to t , where σi is the i th position of σ . By definition, see (1), the length l(t)is lower bounded by �. Therefore, l(t) = � implies that one of (n − �) literals can be addedonly. The set of neighbors of t is denoted by Nt , including t itself. Thus, we have

n − � + 1 ≤ |Nt | ≤ n + 1. (3)

The objective function is defined by the length of terms:

Z(t) := l(t). (4)

From our definition of Nt follows

t ′ ∈ Nt ∧ t ′ �= t → Z(t ′) �= Z(t). (5)

This property of the neighborhood relation allows us to simplify notations in simulatedannealing procedures; see (17).

The set of terms t ∈ C minimizing Z is denoted by C�.

2.2.2. Transition probabilities and convergence results. We denote by G[t, t ′] the prob-ability of generating t ′ ∈ Nt from t and by A[t, t ′] the probability of accepting t ′ once ithas been generated from t . The value of G[t, t ′] depends on the set Nt and we choose theuniform generation probability (we use Nt for |Nt |):

G[t, t ′] :={

1/Nt , if t ′ ∈ Nt ,

0, otherwise.(6)

From the above-mentioned upper bound (3) we have for t ′ ∈ Nt

G[t, t ′] ≥ 1/(n + 1). (7)

As for G[t, t ′], there are different possibilities for the choice of acceptance probabilitiesA[t, t ′]. A straightforward definition related to the underlying analogy to thermodynamicsystems is the following:

A[t, t ′] :={

1, if Z(t ′) − Z(t) ≤ 0,

e−(Z(t ′)−Z(t))/c, otherwise,(8)


where c is a control parameter having the interpretation of a temperature in annealingprocedures. Thus, if Z(t ′)−Z(t) ≤ 0, the transition to t ′ is immediately performed. In caseof Z(t ′) > Z(t), the actual decision whether or not t ′ should be accepted, depends on thefollowing trial: t ′ is accepted, if

e−(Z(t ′)−Z(t))/c ≥ ρ, (9)

where ρ ∈ [0, 1] is a uniformly distributed random number. The value ρ is generated ineach trial if Z(t ′) > Z(t), i.e., the particular ρ are independent. Finally, the probability ofperforming the transition between t and t ′ is defined by

Pr{t → t ′} =

G[t, t ′] · A[t, t ′], if t ′ �= t,

1 −∑t ′ �=t

G[t, t ′] · A[t, t ′], otherwise. (10)

By definition, the probability Pr{t → t ′} depends on the control parameter c.Let at (k) denote the probability of being in the configuration t after k steps performed

for the same value of c. The probability at (k) can be calculated in accordance with

at (k) :=∑

t ′at ′ (k − 1) · Pr{t ′ → t}. (11)

The recursive application of (11) defines a Markov chain of probabilities at (k), where t ∈ Cand k = 0, 1, 2, . . . If the parameter c = c(k) is a constant c = const., the chain is said tobe a homogeneous Markov chain; otherwise, if c(k) is lowered at any step, the sequence ofprobability vectors �a(k) is an inhomogeneous Markov chain.

Usually, homogeneous Markov chains are considered in applications since they tend tothe Boltzmann distribution, cf. [1, 2]. For our DNF approximation problem, we want totake advantage of convergence properties that have been proved by Hajek [14] for inhomo-geneous Markov chains. Hajek’s theorem relates the convergence to optimum (minimum)configurations to a structural property of the configuration space.

We briefly recall Hajek’s result on logarithmic cooling schedules: First, we need tointroduce some parameters characterizing local minima.

Definition 1. A configuration t ′ ∈ C is said to be reachable at height h from t ∈ C, if∃t0, t1, . . . , tr ∈ C(t0 = t ∧ tr = t ′) such that G[tu, tu+1] > 0, u = 0, 1, . . . , (r − 1), andZ(tu) ≤ h, for all u = 0, 1, . . . , r .

We use the notation height (t ⇒ t ′) for this property. The configuration t is a local minimum,if t ∈ C and Z(t ′) ≥ Z(t) for all t ′ ∈ Nt .

Definition 2. Let tmin denote a local minimum, then the depth depth(tmin) is the smallesth such that there exists t ′ ∈ C with Z(t ′) < Z(tmin) that is reachable at height Z(tmin) + h.

We will employ the following


Theorem 1 ([14]). Given a cooling schedule defined by

c(k) = �

ln(k + 2), k = 0, 1, . . . , (12)

the asymptotic convergence∑

t∈C�at (k) −→

k→∞ 1 of simulated annealing, using (6), (8), and(10), is guaranteed if and only if m

(i) ∀ t, t ′ ∈ C∃t0, 0t1, . . . , tr ∈ C(t0 = t ∧ tr = t ′): G[tu, tu+1] > 0, u = 0, 1, . . . , (r −1);

(ii) ∀ h : height(t ⇒ t ′) ≤ h ⇐⇒ height(t ′ ⇒ t) ≤ h;

(iii) � ≥ maxtmin depth(tmin).

In our case of conjunctive terms from C, the parameter � can be upper bounded in thefollowing way: Since we consider f ∈ F where the conjunctions are of length at most �,there exist literals x

σi1i1

, . . . , xσi�i�

such that σ defining C is accepted by xσi1i1

& . . . &xσi�i�

andall elements of NEG are rejected by this conjunction. Thus, if t ∈ C does not contain theseliterals, we can simply add them to t , and, obviously, there exists even a greedy path fromt&x

σi1i1

& . . . &xσi�i�

to xσi1i1

& . . . &xσi�i�

∈ C�. Therefore, we have

� ≤ �. (13)

The actual value of � depends on the structure of POS and NEG, see the defining relation (1),but rough estimations of � by computational experiments usually produce values smallerthan �.

Let k0 denote the maximum of the minimum number of transitions to reach an opti-mum solution starting from an arbitrary t ∈ C. In Section 4, we will prove the followingcomplexity result:

Theorem 2. For the configuration space C from (1) and the set of global minima C� ⊂ C,

the number of transitions

k > max

{2 · n� +

(3

δ

)2·�, k0

}implies for arbitrary initial probability distributions �a(0) the relation∑

t �∈C�

at (k) < δ and therefore,∑t ′∈C�

at ′ (k) ≥ 1 − δ,

where � is chosen according to Theorem 1.

The proof is based on the fact that the probabilities at (k) do not change significantly afterthe indicated number of transitions steps.


3. The DNF approximation algorithm

We consider a straightforward implementation of logarithmic simulated annealing as de-scribed in the previous section. The algorithm works on the set of parameters PAR :=[n, �, kδ, c(k)] and the input IN := [σ ∈ POS, NEG], where c(k) is from (12) for � = �.The parameter kδ denotes the number of trials and should ensure that the probability toreach C� is larger than (1 − δ). Lower bounds for kδ are discussed in Section 4.

The procedure searching for t ∈ C� is denoted by term approx(PAR,IN) and can bedescribed by a simple program as shown in figure 1. When � is chosen according to (13),we can apply Hajek’s theorem and obtain:

Corollary 1. The inhomogeneous Markov chain that is associated with term approx(PAR,IN) tends to the probability distribution limk→∞

∑t∈C�

at (k).

Figure 1.


In order to obtain a hypothesis H that satisfies all m/2 positive examples from POS andrejects all m/2 negative examples from NEG, the procedure term approx(PAR,IN) has tobe performed independently for all m/2 positive examples. The outcome might be the samefor different σ ∈ POS, and it remains to collect all pairwise different conjunctive terms.

As can be seen from the program description in figure 1, the average complexity ofperforming one transition t → t ′ is bounded by O(n · m). For each configuration spacecorresponding to a particular positive example (as defined in (1), the number of transitionsis estimated in Theorem 2. Thus, we obtain

Corollary 2. For |S| = m, the time complexity to obtain a DNF representation H ofS that with probability 1 − δ is smaller or equal to � · m/2 is bounded from above byO(m2 · n · (n� + (3/δ)2·�)).

We implemented term approx(PAR,IN) and performed computational experiments onrandomly generated DNF. The implementation was running on a Sun Ultra 5/350 worksta-tion. The target DNF F was generated randomly for given values of n, s, and �. Then, for Fand a given total number m of examples, an equal number of m/2 positive and m/2 negativeexamples was generated. We are presenting the outcome of computational experiments forn = 1024 and s = 256. The value n = 1024 can be seen as a small fragment of size 322

from a 5122 X-ray image, i.e., the number of variables is in the range of real applications.The hypotheses were computed in accordance with term approx, but no time limit kδ was

used since in all cases we were able to approach even very short conjunctions xσi1i1

& . . . &xσi�i�

rejecting all elements of NEG.For each hypothesis H , additional m examples were generated randomly from the cor-

responding F , all different from the examples used in the learning phase. Then H wasevaluated on these sets of m/2 positive and m/2 negative examples, and the test procedurewas repeated five times.

As shown in Tables 1 and 2, the correct classification increases significantly in case of alarger number of examples per conjunctive term of the target DNF F .

Even a marginally smaller � causes a significant increases of the computation time becausethe search for a shorter conjunction rejecting all elements of NEG becomes more difficult.However, the number of accepted positive examples increases for shorter conjunctions.

Table 1.

Evaluation ofH (σ ) and H (η) for

Examples per F(σ ) = 1 F(η) = 0 Number of Average multiplicity Run-time|POS| + |NEG| term of F (%) (%) terms in H of terms (seconds)

n = 64 s = 128 � = 10

4096 32 70.75 61.08 903 2.27 2610

8192 64 74.95 68.26 1288 3.18 12287

16384 128 81.24 71.24 1854 4.41 69474


Table 2.

Evaluation ofH (σ ) and H (η) for

Examples per F(σ ) = 1 F(η) = 0 Number of Average multiplicity Run-time|POS| + |NEG| term of F (%) (%) terms in H of terms (seconds)

n = 64 s = 128 � = 9

4096 32 74.17 65.04 672 3.04 5462

8192 64 81.57 69.36 944 4.34 63354

16384 128 88.11 70.83 1393 5.88 480974

To obtain a good classification rate for large values of n and s, a very large number oftraining examples is required (see Table 2). On the other hand, it remains open to investigatethe performance of our algorithm on more specific data with a nonuniform generationprobability, i.e., if the training data carry more information about the structure of the probleminstance. In these cases, we expect a much higher classification rate. In particular, futureresearch will include the application of our approximation algorithm to computer-assistedmedical diagnosis.

4. Convergence analysis

Our convergence result is based on a careful analysis of the “exchange of probabilities”between configurations that belong to adjacent distance levels to C�. We introduce thefollowing partition of the set of configurations with respect to the value of the objectivefunction:

L0 := C� and Lh := {t : t ∈ C ∧ (l(t) = � + h)}, 1 ≤ h ≤ n − �. (14)

For any particular term t ∈ C, we introduce notations for the number of neighbors with acertain length. We recall that the definition of the neighborhood relation implies that Nt

contains only one element of length l(t)—the term t itself. Thus, we denote

s(t) := |{t ′ : t ′ ∈ Nt ∧ Z(t ′) > Z(t)}|, (15)

r (t) := |{t ′ : t ′ ∈ Nt ∧ Z(t ′) < Z(t)}|. (16)

Thus, from the definition of Nt we have

s(t) + r (t) = Nt − 1. (17)

We consider the probability at (k) to be in the configuration t ∈ C after k transitions of aninhomogeneous Markov chain that is defined in accordance with (12). We observe that for


our neighborhood relation (cf. the paragraph before (3) we have |Z(t ′)−Z(t)| = 1, t ′ ∈ Nt

and t ′ �= t , and therefore the acceptance probability (8) can be rewritten for γ := 1/� as

e−(Z(t ′)−Z(t))/c(k) = 1

(k + 2)1/�, = 1

(k + 2)γ, k ≥ 0. (18)

4.1. A parameterized expansion of probabilities

In (11), we separate the probabilities according to whether or not t ′ equals t , and theprobability to remain in t is substituted by the defining equation from (10). Thus, weobtain:

at (k) =∑t ′∈Nt

at ′ (k − 1) · Prc(k){t ′ → t} (19)

= at (k − 1) · Prc(k){t → t} +∑t ′ �=t

at ′ (k − 1) · Prc(k){t ′ → t} (20)

= at (k − 1) ·(

1 −∑t ′ �=t

Prc(k){t → t ′})

+∑t ′ �=t

at ′ (k − 1) · Prc(k){t ′ → t}. (21)

The calculation of at (k) is now expressed in terms of structural parameters as defined in(15)–(17):

Lemma 1. The value of at (k) can be calculated from probabilities of the previous step by

at (k) =(

s(t) + 1

Nt−

s(t)∑i=1

(k + 1)−γ

Nt

)· at (k − 1)

+s(t)∑i=1

ati (k − 1)

Nti

+r (t)∑j=1

at j (k − 1)

Nt j · (k + 1)γ.

Proof: The second factor in the first summand of (21) is substituted according to

s(t) + r (t) + 1

Nt−

s(t)∑i=1

(k + 1)−γ

Nt−

r (t)∑j=1

1

Nt= 1 −

∑t ′ �=t

Prc(k){t → t ′},

where we employ (17) and (8), and we apply∑r (t)

j=1 1/Nt = r (t)/Nt .Finally, the second summand of (21) is written explicitly according to (6) and (8). Here,

we have to take into account that the s(t) neighbors are performing a transition to t , whichhas a smaller value of the objective function, and the r (t) neighbors are performing atransition to t with a larger value of the objective function. The generation probabilities aredetermined by 1/Nti , and 1/Nt j , respectively.


The representation (expansion) from Lemma 1 will be used in the following as the mainrelation reducing at (k) to probabilities from previous steps.

Besides taking into account the value of the objective function by classes Lh definedin (14), the elements of the configuration space are distinguished additionally by theirminimum distance to C�: Given t ∈ C, we consider a shortest path of length dist (t) withrespect to neighborhood transitions from t to C�. We introduce a partition of C in accordancewith dist(t):

t ∈ Mi ⇔ dist(t) = i ≥ 0, and M =k0⋃

i=0

Mi , (22)

where M0 := L0 = C� and k0 is the maximum distance. Thus, we distinguish betweendistance levels Mi related to the minimum number of transitions required to reach anelement of C� and the levels Lh .

Since we want to analyze the convergence to elements from M0 = L0 = C� we have toshow that the value∑

t /∈M0

at (k) (23)

becomes small for large k. We suppose that k ≥ k0, and we are going backwards from thekth step.

We consider the expansion of a particular probability at (k) as shown in Lemma 1. Atthe same step k → (k − 1), the neighbors of t are generating terms containing at (k − 1)as a factor, in the same way, as at (k) generates terms with factors ati (k − 1) and at j (k − 1)in Lemma 1. If we consider the entire sum

∑t /∈M0

at (k), then the terms corresponding to aparticular at (k − 1) can be collected together to form a single term.

Firstly, we consider t ∈ Mi , i ≥ 2. In this case, t does not have neighbors from M0, i.e.,the expansion from Lemma 1 appears for all neighbors of t in the reduction of

∑t /∈M0

at (k)to step (k − 1). Therefore, taking all terms together that contain at (k − 1), we obtain

at (k − 1) ·{(

Nt − r (t)

Nt−

s(t)∑i=1

1

Nt · (k + 1)γ

)+

s(t)∑i=1

1

Nt · (k + 1)γ+

r (t)∑j=1

1

Nt

}= at (k − 1). (24)

Secondly, if t ∈ M1, the neighbors from M0 are missing in∑

t /∈M0at (k) at the step to

(k − 1), i.e., they do not generate terms containing probabilities from higher levels.For t ′ ∈ M0, the expansion from Lemma 1 contains the terms ati (k − 1)/Nti for ti ∈ M1

(and there are no terms for t j with a smaller value of the objective function since t ′ ∈ M0).Thus, the terms ati (k − 1)/Nti are not available for t = ti ∈ M1 in the reduction of∑

t /∈M0at (k) to step (k − 1), when one tries to establish a relation like (24) for elements

of M1. For each t ∈ M1, there are r (t) such terms related to neighbors from M0, see (16).Therefore, in the expansion of

∑t /∈M0

at (k), the following arithmetic term is generated when


the particular t is from M1:(1 − r (t)

Nt

)· at (k − 1). (25)

We introduce the following abbreviations:

ϕ(t, v) := (k + 2 − v)−γ

Nt, Dt (k − v) := s(t) + 1

Nt− s(t) · ϕ(t, v). (26)

Now, the relations expressed in (24) and (25) can be summarized to

Lemma 2. A single step of the expansion of∑

t /∈M0at (k) results in

∑t /∈M0

at (k) =∑t /∈M0

at (k − 1) −∑t∈M1

r (t)

Nt· at (k − 1) +

∑t ′∈M0

s(t ′)∑i=1

ϕ(t ′, 1) · ati (k − 1).

Proof: The third summand results from the fact that terms generated according to Lemma 1for elements t ′ ∈ M0 at step (k − 1) cannot be taken together as for t /∈ (M0 ∪ M1) sincefor these t ′ probabilities are not expanded in

∑t /∈M0

at (k), i.e., we have t ′ ∈ M0 ∩Nt . Thus,we obtain from (24) and (25):∑

t /∈M0

at (k) =∑

t /∈(M0∪M1)

at (k) +∑t∈M1

at (k)

=∑

t /∈(M0∪M1)

at (k − 1) +∑t∈M1

(1 − r (t)

Nt

)· at (k − 1)

+∑

t ′∈M0

s(t ′)∑i=1

ϕ(t ′, 1) · ati (k − 1).

The sum containing ϕ(t ′, 1) is generated from ati (k), ti ∈ M1. The negative productat (k − 1) · r (t)/Nt is taken separately, and one obtains the stated equation.

The diminishing factor (1 − r (t)/Nt ) appears by definition for all elements of M1. Atsubsequent reduction steps, the factor is “transmitted” successively to all probabilities fromhigher distance levels Mi because any element of Mi has at least one neighbor from Mi−1.The main task is now to analyze how this diminishing factor changes when it is transmittedto higher distance levels. We denote∑

t /∈M0

at (k) =∑t /∈M0

µ(t, v) · at (k − v) +∑

t ′∈M0

µ(t ′, v) · at ′ (k − v), (27)

i.e., the coefficients µ(t, v) are the factors at probabilities after v steps of an expansion of∑t /∈M0

at (k).


For example, for t ∈ M1 we have µ(t, 1) = 1 − r (t)/Nt (see (25)), and µ(t, 1) = 1 forthe remaining t ∈ M\ (M0 ∪ M1). For t ′ ∈ M0, Lemma 2 implies µ(t ′, 1) = ∑s(t ′)

i= ϕ(t ′, 1).Starting from step (k − 1), the probabilities at ′ (k − v), t ′ ∈ M0, from (27) are expanded

in the same way as the probabilities for all other t /∈ M0.We establish a recursive relation for the coefficients µ(t, v) diefined in (27), where we

apply the same expansion that resulted in Eq. (24) to the products µ(t, v) · at (k − v). Therecursive relation is derived by an inductive step from (k − (v − 1)) to (k − v), v ≥ 2. Theprobabilities at (k − (v − 1)) are expanded in∑

t /∈M0

at (k) =∑t∈M

µ(t, v − 1) · at (k − (v − 1)), v ≥ 2,

according to Lemma 1. The only difference to the steps that led to (24) and (25) is thepresence of terms generated by t ′ = t ∈ M0. We note that the particular summands inthe expansion of at (k − (v − 1)), i.e., the summands at the right hand side in Lemma 1,are multiplied by the corresponding µ(t, v − 1). We recall that the sum

∑t /∈M0

at (k) isconsidered, and taking together all terms associated with a particular at (k − v), we have:

at (k − v) ·{

µ(t, v − 1) ·(

Nt − r (t)

Nt− s(t) · ϕ(t, v)

)(28)

+∑ti >t

µ(ti , v − 1) · ϕ(t, v) +∑t j <t

µ(t j , v − 1)

Nt

}(29)

= at (k − v) · µ(t, v). (30)

Here, we use for neighboring elements t ′ < t if Z(t ′) < Z(t) and t ′ > t for the reverserelation of the objective function. Thus, taking into account s(t) + 1 = Nt − r (t) and (26),we obtain the following parameterized representation:

Lemma 3. The following recurrent relation is valid for the coefficients µ(t, v):

µ(t, v) = µ(t, v − 1) · Dt (k − v) +∑t ′<t

µ(t ′, v − 1)

Nt ′+

∑t ′′>t

µ(t ′′, v − 1) · ϕ(t, v).

(31)

Furthermore,

µ(t, v) =

0, if t ∈ Mi , i > v;

1 − r (t)

Nt, if t ∈ M1, v = 1;

s(t) · ϕ(t, 1), if t ∈ M0, v = 1.

It is important to note that the summands are divided by the same value Nt (see (26)).

Remark 1. To simplify notations, the partial sums in (31) are not subdivided according tothe distance sets Mi to which the particular t ′ and t ′′ belong.


4.2. Reduction of parameters to elementary expressions

In our convergence analysis, we are focusing on the speed of conveergence, i.e., we try toestimate the difference between µ(t, k1) and µ(t, k2) when the expansion started at k1 andk2, respectively. Thus, we do not attempt to estimate the absolute value of µ(t, v).

The following simple transformation allows us to find a closed form formula for an upperbound of |µ(t, k1) − µ(t, k2)|: For t /∈ M0 we consider ν(t, v) = 1 − µ(t, v) instead ofµ(t, v) itself; for elements t ′ from M0 we take the original value µ(t ′, v). As will be seenlater, this transformation allows to identify the “generating elements” of ν(t, v) and µ(t ′, v).

When µ(t, v) is substituted in (31) by 1 − ν(t, v), we obtain the same relation for ν(t, v)because the sum of transition probabilities equals 1 within the neighborhood Nt .

We consider in more details the expressions associated with elements of M0 and M1. Weassume a representation µ(t ′, v − 1) = ∑

u′ T ′u′ and ν(t, v − 1) = ∑

u Tu , where T ′u′ and

Tu are products that have been generated at previous steps from the elementary expressionslisted in Lemma 3 for v = 1. Since there are no t ′′ < t ′ for t ′ ∈ M0, we obtain:

µ(t ′, v) = Dt ′ (k − v) ·∑

u′T ′

u′ (t ′) +∑t>t ′

(1 −

∑u

Tu(t)

)· ϕ(t ′, v) (32)

= s(t ′) · ϕ(t ′, v) + Dt ′ (k − v) ·∑

u′T ′

u′ (t ′) −∑t>t ′

∑u

Tu(t) · ϕ(t ′, v). (33)

In (32) and (33), we consider an arbitrary time step v > 1, and therefore we established thatthe expression s(t ′) · ϕ(t ′, v) is generated for each v > 1 with the corresponding ϕ(t ′, v).When (31) is written for ν(t, v), we obtain in the same way for elements of M1:

ν(t, v) = r (t)

Nt+ Dt (k − v) · ν(t, v − 1) +

∑t ′′>t

ν(t ′′, v − 1) · ϕ(t, v) −∑t ′<t

∑u′ T ′

u′ (t ′)Nt

,

(34)

where r (t)/Nt is from∑

t ′<t 1/Nt . The expression r (t)/Nt appears in all recursive equationsof ν(t, v), t ∈ M1 and v ≥ 1, and the same is valid for the value s(t ′) · ϕ(t ′, v) in allµ(t ′, v), t ′ ∈ M0. Therefore, all products T are derived from expressions of the type r (t)/Nt

and s(t ′) · ϕ(t ′, v).We note that once a product T (i.e., a summand in (31)) has been generated by the

previous value of ν(t, v − 1) or “neighboring” values ν(t ′, v − 1), this product is multipliedin the recursive relation (31) by factors that depend

either on structural properties of the neighborhood relation (that is the summand (s(t)+1)/Nt

in Dt ),or on differences of the objective function as expressed by ϕ(t, v).

We note that during the recursive calculation of values ν(t, v), the only parameter thatchanges in ϕ(t, v) is v. Thus, we try to keep track for each individual product T that isgenerated by a recursive step as given in (31). For this purpose, the coefficients ν(t, v) are


represented by a sum∑

i Ti of products (as in the derivation of (32), . . . ,(34)), and we arenow going to define the products Ti formally by an inductive procedure:

Definition 3. The expressions r (t)/Nt (the first in (34)), t ∈ M1, and s(t ′) · ϕ(t ′, v) (thefirst summand in (33)), t ′ ∈ M0, are called source expressions of ν(t, v) and µ(t ′, v),respectively, where v ≥ 1.

During an expansion of∑

t /∈M0at (k) backwards according to (27), the source expressions

are distributed permanently to higher distance levels M j as well as to elements from M0.That means, in the same way as for M1, the calculation of ν(t, v)(µ(t ′, v) for M0) is pepeatedalmost identically at any step, only the “history” of generations becomes longer.

We introduce a counter r(T ) that indicates the step at which the expression (product) Thas been generated from source expressions. The value r(T ) is called the rank of T and weset r(T ) = 1 for source expressions T from Definition 3.

Basically, the rank r(T ) ≥ 1 indicates the number of factors when T is represented bythe subsequent multiplications according to the recurrent generation rules (33) and (34).Thus, we assume representations ν(t, u) = ∑

i Ti , µ(t ′, v) = ∑j Tj for 0 ≤ u ≤ v and

t ∈ M \ M0, t ′ ∈ M0, where Ti , Tj are positive or negative products of source expressions.

Definition 4. For t ′ ∈ M0, the rank j is assigned to expression T ′ of µ(t ′, v), wherev, j ≥ 2, if either T ′ = −T · ϕ(t ′, v) and r(T ) = j − 1 for ν(t, v − 1), t ∈ M1 ∩ Nt ′ , orT ′ = T · Dt ′ (k − v) and r(T ) = j − 1 for µ(t ′, v − 1).

The assignment r(T ′) = j is a direct consequence of (33). Now, for t ∈ M1 we have toconsider ν(t, v) instead of µ(t ′, v):

Definition 5. The rank j is assigned to expression T of ν(t, v), t ∈ M1 and v, j ≥ 2, ifT = T · ϕ(t, v) and r(T ) = j − 1 for some ν(t, v − 1), t ∈ M2 ∩ Nt ,

or T = T · Dt (k − v) and r(T ) = j − 1 for ν(t, v − 1),or T = −T ′/Nt and r(T ′) = j − 1 for some µ(t ′, v − 1), t ′ ∈ M0 ∩ Nt .

The three cases correspond to the second, third, and fourth summand in (34); the firstsummand is a source expression generated at this step. Finally, we consider the case t ∈ Mi ,i ≥ 2:

Definition 6. The rank j is assigned to expression T of ν(t, v), t ∈ Mi and v, i, j ≥ 2, ifT = T · F and r(T ) = j − 1 for some ν(t, v − 1), t ′ ∈ Mi+1 ∩ Nt (the factor F dependson Lh to which t ′ ∈ Mi+1 belongs, cf. Remark 1),

or T = T · Dt (k − v) and r(T ) = j − 1 for ν(t, v − 1),or T = T · F and r(T ) = j − 1 for some ν(t, v − 1), t ∈ Mi−1 (the factor F depends on

Lh to which t ∈ Mi−1 belongs).


Let T j (t, v) be the set of products T of ν(t, v) with the same rank r(T ) = j , wheret /∈ M0. We set

S j (t, v) :=∑

T ∈T j (t,v)

T . (35)

The same notation is used in case of t ′ = t ∈ M0 with respect to µ(t ′, v). Now, thecoefficients ν(t, v) and µ(t ′, v) can be represented by

ν(t, v) =v∑

j=1

S j (t, v) and µ(t ′, v) =v∑

j=1

S j (t′, v). (36)

We compare the computation of ν(t, v) and µ(t ′, v) for two different values v = k1 andv = k2, i.e., ν(t, v) and µ(t ′, v) are calculated backwards from k1 and k2, respectively. LetS1

j and S2j denote the corresponding sums of expressions from (35) related to two different

starting steps k1 and k2. From Definition 3 we see that the source expression r (t)/Nt doesnot depend on k. For the second type of source expressions, we employ the simple equationk2 − (k2 − k1 + v) = k1 − v.

Lemma 4. Given k2 ≥ k1 and 1 ≤ j ≤ k1, then for each t ∈ M:

S1j (t, v) = S2

j (t, k2 − k1 + v).

Proof: We consider (27) for k2 and v′ = k2 −k1 +v. The coefficients are µ2(t, k2 −k1 +v)and µ2(t ′, k2 − k1 + v). For k1 and v, the coefficients from (27) are simply µ1(t, v) andµ1(t ′, v). We prove the proposition by induction over j .

If j = 1, the we obtain from (34) for k1, t ∈ M1, and ν1(t, v) = 1 − µ1(t, v) therelation S1

1(t, v) = r (t)/Nt . Since the source expression is independent of v for t ∈ M1, cf.Definition 3, we have S2

1(t, k2 − k1 + v) = r (t)/Nt . At higher distance levels Mi , i > 1, wehave by Definition 3 until Definition 6 the relation S1

1(t, v) = S21(t, k2 − k1 + v) = 0.

Thus, for j = 1 it remains to consider t ′ ∈ M0. For k1, we obtain from (34) andDefinition 3 the relation S1

1(t ′, v) = s(t ′) · ϕ(t ′, v), and for k2 we have S21(t ′, k2 − k1 + v) =

s(t ′) · ϕ(t ′, k2 − k1 + v). But (26) implies

ϕ(t ′, v) = (k1 + 2 − v)−γ

Nt ′= (k2 + 2 − (k2 − k1 + v))−γ

Nt ′= ϕ(t ′, k2 − k1 + v),

and therefore S11(t ′, v) = S2

1(t ′, k2 − k1 + v), t ′ ∈ M0.If j > 1, then we obtain S1

j (t, v) and S2j (t, k2 − k1 + v) from source values S1

1(t, v −j + 1) = S2

1(t, k2 − k1 + v − j + 1) that are multiplied by the same factors because ofϕ(t, v) = ϕ(t, k2 − k1 + v); cf. Definition 4, . . . , Definition 6.

Remark 2. If Lemma 4 is applied to v = k1, then S1j (t, k1) can be substituted by S2

j (t, k2).


We recall that our main goal is to upper bound the sum∑

t /∈M0at (k). When �a(0) denotes

the initial probability distribution, we have from (27):∣∣∣∣∣ ∑t /∈M0

(at (k2) − at (k1))

∣∣∣∣∣ ≤∑t /∈M0

|(ν(t, k2) − ν(t, k1)) · at (0)|

+∑

t ′ /∈M0

|(µ(t ′, k2) − µ(t ′, k1)) · at ′ (0)|.

Any element of the initial distribution �at (0) is simply replaced by 1 and we obtain for thefirst part of the sum in accordance with Lemma 4 and Remark 2:

∑t /∈M0

|ν(t, k2) − ν(t, k1)| =∑t /∈M0

∣∣∣∣∣ k2∑j=1

S2j (t, k2) −

k1∑j=1

S1j (t, k1)

∣∣∣∣∣=

∑t /∈M0

∣∣∣∣∣ k2∑j=1

S2j (t, k2) −

k1∑j=1

S2j (t, k2)

∣∣∣∣∣=

∑t /∈M0

∣∣∣∣∣ k2∑j=k1+1

S2j (t, k2)

∣∣∣∣∣≤

∑t /∈M0

k2∑j=k1+1

∣∣S2j (t, k2)

∣∣.Therefore, we have

∑t /∈M0

|(ν(t, k2) − ν(t, k1)) · at (0)| ≤∑t /∈M0

k2∑j=k1+1

∣∣S2j (t, k2)

∣∣. (37)

The same applies to t ′ ∈ M0. Now, we are going to find upper bounds for the values|S j (t, k)|.

4.3. Upper bounds for elementary expressions

The recursive application of (31) generates negative summands in the representation ofvalues S j (t, v), as can be seen from Definitions 4 and 5 (cf. also (33) and (34)).

We set S j (t, v) = S+j (t, v) − S−

j (t, v) and S j (t ′, v) − S−j (t ′, v) for t ∈ M1 and t ′ ∈ M0,

where the partial sums consist of positive products only.When S j (t, v), t ∈ M1, and S j (t ′, v), t ′ ∈ M0, are calculated, the negative products of

S j−1(t, v − 1) become positive for S j (t ′, v), and the negative products of S j−1(t ′, v − 1)become positive for S j (t, v), see (33) and (34). The negative products of S j−1(t, v − 1)remain negative in the calculation of S j (t, v), t ∈ M2, and the same applies to higherdistance levels. Hence, the negative and positive products can be considered separately atall distance levels.


Thus, we concentrate on upper bounds of S+j (t, v) only. To simplify notations, we use

S j (t, v) instead of S+j (t, v).

We investigate the structure of products T that represent the values S j (t, v), t /∈ Mi ,when Definition 4, . . . , Definition 6 are applied recursively:

S j (t, v) :=∑

T ∈T j (t,v)

T, t ∈ Mi , (38)

see (35). When necessary, we emphasize the last factor(s) of T that were generated by therecursion (31). We are going to distinguish the products arising from the recursive procedureby the number of six different types of factors:

T = �m11 · �

m22 · · · �m6

6 , (39)

where w(m1) + w(m2) + · · · + w(m6) = j for weights w(ml) ∈ {1, 2}, l = 1, 2, . . . , 6.The tuple [m1, . . . , m6] is called the signature of T . The weight 2 results from the fact thatsome �l represent the product of two source expressions. We do not keep the notation by�l in order to maintain the relation to notations introduced earlier.

1. The first type �1 are factors Dt (k − v) from self-transitions, see (26). For simplicity,we use D = Dt (k − v) since we are at the moment interested in the number of factorsonly. The dependence on (k − v) is taken into account again, when the value of D isimportant rather than the number of factors D, as in (45), (48), (63), and in the Appendix.Since Dt (k −v) is a single factor in the recurrent procedure (see (28, . . . , (30)), we havew(D) = 1.

2. The factors of the second type �2 are denoted by E . These factors are related to the“exchange” of expressions between neighboring levels Mi±1. More formally, the factorsE arising in products T from (38) are defined in the following inductive way:

(a) The source expressions from Definition 3, i.e., the values S1(t, v), t ∈ M1, andS1(t ′, v), t ′ ∈ M0, are both represented by the formal factor E . Obviously, there isno factor D at this first step.

(b) For the inductive step j → j + 1, we take the products T ′i±1 that are summands in

the representation (38) of S j (t, v), t ∈ Mi±1, with the same signature (i.e., T ′i−1 for

S j (t, v), t ∈ Mi−1, and T ′i+1 for S j (t, v)), t ∈ Mi+1. We set

Ti+1 = maxT ′

i+1 from Mi+1 ∩Nt

T ′i+1;

Ti−1 = maxT ′

i−1 from Mi−1 ∩Nt

T ′i−1.

According to (28), . . . , (31) (see also Definition 4, . . . , Definition 6), we obtain atstep j → j + 1 the products

T i+1 = Ti+1 ·(

ri+1(t)

Nt+ si+1(t) · ϕ(t, v)

); (40)

T i−1 = Ti−1 ·(

ri−1(t)

Nt+ si−1(t) · ϕ(t, v)

). (41)


as summands for an upper bound of S j+1(t, v+1), t ∈ Mi . Here, ri+1(t), . . . , si−1(t)indicate the number of neighbors t from distance levels Mi±1 with the correspondingvalue of the objective function Z(t) with respect to Z(t). Now, the value

ri+1(t)

Nt+ si+1(t) · ϕ(t, v) + ri−1(t)

Nt+

si−1(t)∑u′′=1

ϕ(t, v) = r (t)

Nt+ s(t) · ϕ(t, v)

(42)

is represented by the formal factor E . We note that except for t itself all neighborsof t are taken into account by this step, and the normalization is given by Nt , seealso Remark 1.

We recall that we are interested in an uper bound for |S2j (t, k2)|, see (37). Since we

consider products having the same only (i.e., S+j (t, v)), we can change (38) to uppper

bounds:

S j (t, v) ≤∑

T ∈T j (t,v)

T . (43)

Thus, at step ( j + 1) we use for t ∈ Mi the following upper bound:

T i+1 + T i−1 ≤ max{Ti+1, Ti−1} · E, (44)

see (40) and (41). We note that we still use E when the part either from Mi−1 or Mi+1 ismissing due to the recursive definition at the first steps of the expansion. As for D, wehave w(E) = 1.

Example. In figure 2, we are giving examples how the factors D and E emerge duringthe first three steps of the calculation of S1 to S3 at levels M0 to M3 (negative expressionsare included for M0 only): The column for S1 follows from Lemma 3 and Definition 3.For S2 and M1, there is no “contribution” from M2, and we have only a negative termfrom M0 which is omitted. Within M1, the term S1 = E is multiplied by D. For S2 andM0, the negative term is included as indicated above. At M3, it is not exculded that S3

Figure 2.


is calculated by a step downwards to a lower Lh , e.g., for a particular term t ∈ M3 theupper bound for S3(t, 3) might be E · E · E where the third E is related to ϕ(t, 3) only.The expression E · E · E of S3 at M1 is calculated from −E · E of M0 and E · E of M2

according to (42).

The factor E is in some sense the complement to D: During the successive generationof the upper bounds, the following product of factors E and D is produced:(

r (t)

Nt+ s(t) · ϕ(t, v)

)·(

s(t) + 1

Nt− s(t) · ϕ(t, v)

); (45)

see (26) and (42). We note that the order of factors is important because for E · D thefactor D from self-transitions is indeed related to r (t) and s(t) from E since E wasgenerated at the previous step for t ∈ Mi .

3. We recall that s(t) + 1 + r (t) = Nt (see (17)), and for upper bounds of S j (Mi , v) wehave to maximize(

X

N+ Y

)·(

N − X

N− Y

). (46)

We set ψ :=(k +2−v)γ and by straightforward calculations we obtain for (45) the upperbound

E · D ≤ 1

4− 1

4 · N 2+ 1

2 · N 2 · ψ− 1

4 · N · ψ · (ψ − 1)<

1

4− 1

4· 1

α2(n), (47)

where α(n) is a sufficiently large function, e.g., α(n) = n3/2.We introduce the third type �3 of factors F0 by setting

F0 := 1

4·(

1 − 1

α2(n)

). (48)

Thus, from (47) we have

E · D < F0 and w(F0) = 2. (49)

For the recursive steps we need one modification of F0 which is the fourth factor that isdefined by

F1 := 2 · F0, where w(F1) = 2. (50)

4. The fifth type of factors �5 is denoted by G and equal to F1 · (D + D)/2 for tworepresentatives D = Dt (k − v) and D = Dt (k − v). Since w(m1) = 1 and w( f1) = 2,we have w(G) = 3.


5. The sixth type of factors �6 is denoted by H and employs D + E = 1 for D and Efrom the same t ∈ Mi . The equation D + E = 1 follows from the definition of D andE in the same way as (24), taking into account (17). For this factor we have w(h) = 1,as will be seen later.

In order to define H , we distinguish in upper bounds of S j (t, v), t ∈ Mi , between twotypes of products T ′, which is now considered separately in more details.

As can be seen from (28), . . . , (30), and (31), there are two partial steps when the newproducts T are calculated during the transition from ( j − 1) to j : The first partial step isto calculate the products T by multiplying them with Dt (k − v). This relates to the valueof S j−1(t, v − 1), t ∈ Mi , from step ( j − 1); see the first summand of the right hand sideof (31) (we note that S j−1(t, v − 1), is part of µ(t, v − 1), respectively ν(t, v − 1), asexpressed in (36)). The second partial step relates to products T from step ( j − 1) beingpart of S j−1(t, v − 1) for neighboring elements t /∈ Nt\{t}, i.e., t ∈ (Mi+1 ∪ Mi−1) ∩ Nt .These products become part of an upper bound of S j (t, v) as shown in (40), . . . , (44).

Definition 7. A product T from an upper bound (43) of S j (t, v), t /∈ Mi , is called Type Iproduct, iff it has been generated from products of the preceding value S j−1(t, v − 1) bymultiplication with Dt (k − v).

A product T is called Type II product, iff T is the result of the partial steps (40), . . . ,(44) and therefore can be represented as T = T ′ · E .

Since we consider upper bounds of S j (Mi , v), we can further simplify the relation (43):By T i

I/II(m) we denote the product that is calculated as the representative of signature m forthe corresponding type. Since all type II products are ending on E (see Definition 7), it isreaquired that the representative T i

II(m) has the last factor E .

Remark 3. The products T iI/II(m) are not simply based on a max-operation: They are

calculated by specific procedures for each of the both Types I and II, where the signaturesm ′ of products from the preceding step, involved in the calculation, is crucial.

By AiI/II(m) we denote the coefficient to the representative T i

I/II(m), which at the firststep j = 1, and, maybe, for some further steps is simply the number of related products.The calculation of representatives has to ensure that

S j (t, v) ≤∑

m

(Ai

I(m) · T iI(m) + Ai

II(m) · T iII(m)

). (51)

Now, we are going to describe the calculation of representatives T iI/II(m):

Type I Representatives:

By definition, during the step ( j − 1) → j (at most) two products T iI (m1) · D and

T iII(m2) · D are generated, where T i

II(m′2) = T i

II(m′2) · E and T i

II(m′2) is determined by

the max-operation of the right hand side of (44) (which does not change if representatives


from Mi±1 of the same signature are considered). To calculate the representative T iI (m)

at step j for m = [d, e, f0, f1, g, h], we choose m1 = [d, e, f0, +1, f1, g − 1, h] andm2 = [d + 1, e + 1, f0, f1, g − 1, h]. Since j = d + e + 2 · f1 + 3 · g + h, we haved + e + 2 · ( f0 + 1) + 2 · f1 + 3 · (g − 1) + h = j − 1 and (d + 1) + (e + 1) + 2 · f0 +2· f1+3 · (g−1) = j −1. Both signatures m1 and m2 are strictly determined in their relationto m and therefore the representatives T i

I (m1) and T iII(m2) are uniquely defined by∑

l w(ml) = j − 1.

1. For m, there exist both representatives for m1 and m2.One of the ( f0 + 1) factors F0 in T i

I (m1) is changed to (1/2) · F1 (see (48)). We obtainthe first summand for the representative T

′iI (m ′

1) · ((1/2) · F1 · D).In T i

II(m′2) · E · D, the product E · D is substituted by (1/2) · F1, see (48) and (50).

One of the (d + 1) factors is taken together with (1/2) · F1, and we obtain the secondsummand for the representative T i

II(m′2) · ((1/2) · F1 · D).

We note that m ′1 = m ′

2, and for each product of factors of the same type �l from

T ′iI(m

′1), T ′i

I(m′2) we take the maximum and build the new product T i

I (m ′) ≤ T ′iI(m

′1),

T iII(m

′2). Finally, we set

T iI (m) := T i

I (m ′) · G, (52)

where G = F1 · (D + D)/2. We note that D = Dt (k − v) for t ∈ Mi and v defined bythe present step j .

2. If for m one of the representatives for m1 or m2 is missing, the procedure for the singlerepresentative is performed as in 1. and the result is taken as T i

I (m) with G defined onlyby one component.

3. Formally, all possible m are considered, i.e., 0 ≤ w(ml) ≤ 3 and∑

l w(ml) = j ,and therefore each of the T i

I(m1), T iII(m2) is involved in exactly one computation of a

representative: If f0 = 0 and m1 is missing, then E · D produces F0, and F1 can bederived. The factor E is present in m2 from the first step onwards; cf. (a) and (b) in thederivation of (42).

Type II Representives:

By definition of Type II products, the following sums are generated from Mi−1 and Mi+1,respectively, at step j (see (40) to (44)) for t ∈ Mi :(

T i±1I (m1) + T i±1

II (m2)) · E, (53)

where T i±1I (m1) = T i±1

I (m ′1) · G (see (52)) and T i±1

II (m2) = T i±1II (m ′

2) · E .

1. We first consider the sums separately for each t ∈ Mi±1 before applying the max-operations as in (44). For the calculation of the Type II representative, we require m1 =[d − 1, e, f0, f1 − 1, g + 1, h − 1] and m2 = [d, e, f0 + 1, f1 − 1, g, h − 1]. We have(d − 1) + e + 2 · f0, 2 · ( f1 − 1) + 3 · (g + 1) + (h − 1) = j − 1 and d + 2 · ( f0 + 1) +


2 · ( f1 − 1) + 3 · g + (h − 1) = j − 1, where m = [d, e, f0, f1, g, h] for t ∈ Mi at stepj . We note at this point already that one factor E vanishes in m1 as well as in m2, but Efrom (53) is added.

We consider only t ∈ Mi+1; the calculation for Mi−1 is the same. From (53) we have

T i±1I (m ′′

1) · E ′ · G + T i+1II (m ′′

2) · D′′ · F0 · E . (54)

By definition of the factor G we have

G = F1 · (D + D′)/2 = F0 · D + F0 · D′. (55)

It is important to note that D in the definition of G, and E in (54) are from the samet ∈ Mi+1. This follows directly from the calculation of the Type I Representative (see(52)) and the definition of Type II products (and representatives). Therefore, we haveD + E = 1; see (17) and (24).

In (54), the factors where m1 and m2 are different have been written separately (plusone E-factor), and therefore we can take the maximum of T i+1

I (m ′′1) and T i+1

II (m ′′2) which

is denoted by T i+1II (m ′′). Thus, we can concentrate on E ′ · G + D′′ · F0 · E :

E ′ · G + D′ · F · E = (E ′ · F0 · D + E ′ · F0 · D′) + D′′ · F0 · E (56)

= (E ′ · D + D′′ · E) · F0 + E ′ · F0 · D′. (57)

For E ′ · D + D′′ · E we can prove:

If k + 2 ≥ n�, then E ′ · D + D′′ · E < D · H, (58)

where H := 1 − 1/(n · (k + 2)2/�). The proof is given in the Appendix.By definition of H , we have E ′ < H for arbitrary values of factors E ′. Thus, from

(56) to (58) we obtain

E ′ · G + D′′ · F0 · E ≤ D · H · F0 + E ′ · F0 · D′ (59)

< D · H · F0 + H · F0 · D′ ≤ max{D, D′} · H · 2 · F0 (60)

= D · H · F1. (61)

For D, we can simply take the value Dt (k − (v − 1)) for a local minimum t . Now, thesame procedure is performed for Mi−1. Let T i

II(m′′) denote the maximum of T i−1

II (m ′′)and T i+1

II (m ′′). We finally set

T iII(m) := T i

II(m′′) · D · H · F1 · E (62)

to be the representative of signature m for factors of Type II.For (m ′′) we have (m ′′) = [d − 1, e − 1, f0, f1 − 1, g, h − 1] and therefore (d − 1) +

(e − 1) + 2 · f0 + 2 · ( f1 − 1) + 3 · g + (h − 1) = j − 5. The suffix from (62) adds1 + 1 + 2 + 1 = 5 which results in

∑l w(ml) = j for m.


2. For m, the two “source representatives” are defined uniquely by m1 and m2. If one ofthe representatives for m1 or m2 is missing, the procedure for the single representativeis performed as if there were two representatives; the calculated factors might only besmaller than D and H , and we can introduce D and H again as upper bounds.

3. All possible m are considered, i.e., 0 ≤ w(ml) ≤ 3 and∑

l w(ml) = j , and therefore eachof the T i±1

I (m1), T i±1II (m2) is involved in exactly one computation of a representative.

We note that the factor E (as E) is again present in m as the last factor.

We are now going to analyze how the coefficients change when the representatives arecalculated:

Lemma 5. For the coefficients of Type I and Type II representatives, the following upperbounds are valid for all i and m:

AiI(m) ≤ 1 and Ai

II(m) ≤ 1.

Proof: The proof is performed by induction. For j = 1, the proposition is true; it followsfrom the definition of factors �1 = D and �2 = E ; see also the example.

For j > 1, we consider the calculation of Type I and Type II representatives: In(52) as well as in (62), only a single representative is calculated. Therefore, we haveAi

I(m) ≤ max{AiI(m1), Ai

II(m2)} and AiII(m) ≤ max{Ai±1

I (m1), Ai±1II (m2)}. The coefficients

at the right hand sides are from step ( j − 1), and therefore the proposition is true forj > 1.

Now, we can complete the derivation of an upper bound of∑

t /∈M0

∑k2j=k1+1 |S2

j (t, k2)|,see (37). Since the coefficients associated with formal factors can be bounded from aboveby 1 according to Lemma 5, it remains to consider upper bounds for the powers of eachparticular formal factor only. We use (1 + 1/x)x+1 > e for x > 0, and the upper bounds forpowers of formal factors can be summarized in the following way:

Da <

a−1∏v=0

e− 1

n·(k+2−v)γ ≤ e− a

n·(k+2)γ ; (63)

Eb ≤(

1 − 1

n

)b

< e− bn , follows from

r (t)

n+ n − 1 − r (t)

n · (k + 2 − v)γ< 1 − 1

n; (64)

Fc0 = 1

4c·(

1 − 1

α2(n)

)c

<1

4c· e

− cα2(n), cf. (48); (65)

Fd1 = 1

2d·(

1 − 1

α2(n)

)d

<1

2d· e

− dα2(n), cf. (50); (66)

G9 = F g1 · e

− gn·(k+2)γ, cf. (52); (67)

H h =(

1 − 1

n · (k + 2)2/�

)h

< e− h

n·(k+2)2/� , cf. (58). (68)


The proof of Lemma 5 requires k + 2 ≥ n� , see (58), and we have α(n) = n3/2, cf. (47).Thus, G is the largest factor, and we take into account that there are less than j6 partitionsof j of the type d + e + 2 · ( f0 + f1) + 3 · g + h = j .

The factor H determines the condition k1 > n� − 2. We define a new value for k0 bysetting k0 := max{n� − 2, k0}, where the previous k0 is from (22). Thus, together with (2)we obtain

∑t /∈M0

k2∑j=k1+1

∣∣S2j (t, k2)

∣∣ <

k2∑j=k1+1

j−1/� · e− j−k0

n·(k2+2−(k2− j))2/�

=k2∑

j=k1+1

k−1/�

1 e− j−k0

n·( j+2)2/�.

Except for some of the first summands, the elements of the sum tend to zero very fast. Ifthe first elements are analyzed carefully, one can derive the following upper bound:

Lemma 6. Given the maximum escape height � and k0 = n� − 2, then k2 > k1 > (2 · n)�

implies∣∣∣∣∣ ∑t /∈M0

(v(t, k2) − v(t, k1)) · at (0)

∣∣∣∣∣ < k−1/(2·�)1 .

In Lemma 6 we consider S+j , and (37) has been applied to these values only. But the same

holds for S−j , with even a smaller first factor of the expansion, see Lemma 3. Thus, we can

derive in the same way the corresponding upper bound for (µ(t ′, k1) − µ(t ′, k2)).Now, we can derive the

Proof of Theorem 2: We employ the following representation:∑t /∈M0

at (k) =∑t /∈M0

(at (k) − at (k2)) +∑t /∈M0

at (k2)

=∑t /∈M0

(ν(t, k2) − ν(t, k)) · at (0)

+∑

t ′ /∈M0

(µ(t ′, k) − µ(t ′, k2)) · at ′ (0) +∑t /∈M0

at (k2).

We utilize Theorem 1, i.e., if the constant � from (12) is sufficiently large, the inhomo-geneous simulated annealing procedure defined by (6), (8), and (10) tends to the globalminimum of Z on C. The value k2 from Lemma 6 is larger but independent of k1 = k, i.e.,we can take a k2 > k such that∑

t /∈M0

at (k2) <δ

3.


We obtain the stated inequality, if additionally both differences∑

t /∈M0(ν(t, k2)−ν(t, k)) and∑

t ′ /∈M0(µ(t ′, k) − µ(t ′, k2)) are smaller than δ/3. From Lemma 6, we obtain the condition

k−1/(2·�)1 <

δ

3.

Together with k1 = k ≥ 2 · k0 and �/(� − 2) < 2, we finally arrive at

k > (2 · n)� +(

3

δ

)2·�.

The results reflects the observations from computational experiments: For a given positiveexample σ and negative examples NEG, the local search moves relatively fast to shorthypotheses t that satisfies σ and rejects all elements of of NEG. The reason for this is basedon the relatively large number of t ′ ∈ C�, and therefore, at the beginning the algorithmsruns almost like a greedy algorithms. If we consider, e.g., � = O(log n) (see Section 2.1),then for t of length � that do not belong to C� at most �(= �) literals have to be changedin order to reach an element of C�. However, due to the large value of |C�|, the number ofchanges can be expected to be much smaller in “most cases”. Thus, the “real search” startsfor elements of length ≈ (log n + ψ(n)) for ψ(n) � log n. This “search space” consists of( n

log n+ψ(n) ) ≈ (n/(log n + ψ(n)))log n+ψ(n) elements, which implies for a complete search

the approximate bound ψ(n) ≤ log log n when compared to (2 · n)�.On the other hand, the complete search among all ( n

�) candidates for C� implies a much

shorter number of tests with σ and NEG than our simulated annealing procedure with (2·n)�

transitions, but simulated annealing can be easily implemented as shown in Section 3.

5. Concluding remarks

We have investigated the problem whether disjunctive normal forms of polynomial size canbe approximated efficiently from positive and negative examples by local search procedures.The input examples are distributed uniformly on {0, 1}n . We have shown that there existsa simulated annealing-based procedure which, given a polynomial number of uniformlydistributed positive and negative examples of the target disjunction of conjunctions F ,computes in time O(n�) + (3/δ)2·� , a hypothesis H such that with confidence at least(1−δ) all examples are classified correctly, where � is the maximum of the minimal escapeheight from local minima of the associated energy landscape. The run-time is derived from acareful analysis of the underlying inhomogeneous Markov process. The algorithm has beenimplemented and tested for several parameter settings. Although our convergence analysisis focusing on the set of presented learning examples only, the computational experimentson randomly chosen DNF have produced good results on the same number of test examplesdifferent from the learning set. Since � depends on the sample set, the convergence analysisprovides a basis for studying the average case complexity. Future research will be directedtowards applications of our local search method on structured data from medical diagnosis.In particular, we intend to apply our approximation algorithm to diagnostics of diabetes.


This makes it necessary, as described in [8], to extend our binary classification method tomulticategory classifications.

Appendix

Proof of (58): For the particular factor D′′ we utilize the following upper bound, cf. (3):

D′′ = s(t) + 1

Nt− Nt − r (t) − 1

Nt· 1

(k + 2 − v)γ< 1 − 1

n · (k + 2 − v)γ. (69)

In the expression E ′ · D + D′′′ · E , we substitute E ′ by its value and D′′′ by the upper (70):

E ′ · D + D′′ · E <

(r ′(t)N ′ + s ′(t)

N ′ · 1

(k + 2 − v′)γ

)· D

+(

1 − 1

n · (k + 2 − v′′′)γ

)· E .

Thus, we have to show(r ′(t)N ′ + s ′(t)

N ′ · 1

(k + 2 − v′)γ

)· D +

(1 − 1

n · (k + 2 − v′′′)γ

)· E

<

(1 − 1

n · (k + 2)γ

)·(

1 − 1

n · (k + 2)2/�

),

where (1 − 1/(n · (k + 2)γ )) stands for D. We obtain:

r ′(t)N ′ · D + E + s ′(t)

N ′ · D

(k + 2 − v′)γ− E

n · (k + 2 − v′′′)γ

< 1 − 1

n · (k + 2)γ− 1

n · (k + 2)2/�+ 1

n2 · (k + 2)3/�.

Since D + E = 1, the condition turns to

0 <

(s ′(t) + 1

N ′ − s ′(t)N ′ · 1

(k + 2 − v′)γ

)· D + 1

n · (k + 2 − v′′′)γ· E

+ 1

n2 · (k + 2)3/�− 1

n · (k + 2)γ− 1

n · (k + 2)2/�.

From N ′ ≤ n we obtain that the first factor at the right hand side is ≥1/n. The sameapplies to D and therefore, by our assumption k + 2 ≥ n� , the value 1/(n · (k + 2)γ ) iscompensated. Furthermore, we have E ≥ min{1/n, (N − 1)/(N · (k + 2)γ )}, 2 ≤ N ≤ n,and v′′′ ≥ 0. Thus, the second summand on the right hand side is compensated, where wetake into account that E = (N − 1)/(N · (k + 2)γ ) implies that D corresponds to a localminimum and is therefore = 1 − (N − 1)/(N · (k + 2)γ ).


Acknowledgment

The authors would like to thank the referees for their careful reading of the manuscript andhelpful suggestions that resulted in an improved presentation.

Research partially supported by the Strategic Research Program at The ChineseUniversity of Hong Kong under Grant No. SRP 9505 and a Hong Kong Government RGCEarmarked Grant, Ref. No. CUHK 4010/98E.

References

1. E.H.L. Aarts and J.H.M. Korst, Simulated Annealing and Boltzmann Machines: A Stochastic Approach,Wiley & Sons: New York, 1989.

2. E.H.L. Aarts, Local Search in Combinatorial Optimization, Wiley & Sons: New York, 1998.3. H. Aizenstein and L. Pitt, “On the learnability of disjunctive normal form formulas,” Machine Learning,

vol. 19, pp. 183–208, 1995.4. A. Albrecht, S.K. Cheung, K.S. Leung, and C.K. Wong, “Stochastic simulations of two-dimensional composite

packings,” J. of Computational Physics, vol. 136, pp. 559–579, 1997.5. D. Angluin, “Queries and concept learning,” Machine Learning, vol. 2, pp. 319–342, 1988.6. A. Bachem, W. Hochstattler, B. Steckemetz, and A. Volmer, “Computational experience with general equi-

librium problems,” Computational Optimization and Applications, vol. 6, pp. 213–225, 1996.7. E.J. Bredensteiner and K.P. Bennett, “Feature minimization within decision trees,” Computational Optimiza-

tion and Applications, vol. 10, pp. 111–126, 1998.8. E.J. Bredensteiner and K.P. Bennett, “Multicategory classification by support vector machines,” Computational

Optimization and Applications, vol. 12, pp. 53–79, 1999.9. O. Catoni, “Rough large deviation estimates for simulated annealing: Applications to exponential schedules,”

The Annals of Probability, vol. 20, pp. 1109 – 1146, 1992.10. O. Catoni, “Metropolis, simulated annealing, and iterated energy transformation algorithms: Theory and

experiments,” J. of Complexity, vol. 12, pp. 595–623, 1996.11. V. Cerny, “A thermodynamical approach to the travelling salesman problem: An efficient simulation algo-

rithm,” J. Optim. Theory Appl., vol. 45, pp. 41–51, 1985.12. P. Clark and T. Niblett, “The CN2 induction algorithm,” Machine Learning, vol. 3, pp. 261–283, 1989.13. S.D. Flam, “Learning equilibrium play: A myopic approach,” Computational Optimization and Applications,

vol. 14, pp. 87–102, 1999.14. B. Hajek, “Cooling schedules for optimal annealing,” Mathem. Oper. Res., vol. 13, pp. 311–329, 1988.15. J. Jackson, “An efficient membership-quey algorithm for learning DNF with respect to the uniform distribu-

tion,” In Proc. of the 35th Annual Symposium on Foundations of Computer Science, 1994, pp. 42–53.16. M. Kearns, M. Li, L. Pitt, and L.G. Valiant, “Recent results on Boolean concept learning,” in Proc. 4th Int.

Workshop on Machine Learning, 1987, pp. 337–352.17. S. Kirkpatrick, C.D. Gelatt, Jr., and M.P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220,

pp. 671 – 680, 1983.18. M.H. Lim, Y. Yuan, and S. Omatu, “Efficient genetic algorithms using simple genes exchange local search

policy for the quadratic assignment problem,” Computational Optimization and Applications, vol. 15, pp. 249–268, 2000.

19. Y. Mansour, “An nO(log log n) learning algorithm for DNF under the uniform distribution,” J. of Computer andSystems Sciences, vol. 50, pp. 543–550, 1995.

20. R.J. Mooney, “Encouraging experimental results on learning CNF,” Machine Learning, vol. 19, pp. 79–92,1995.

21. G. Righini, “Annealing algorithms for multisource absolute location problems on graphs,” ComputationalOptimization and Applications, vol. 7, pp. 325–337, 1997.

22. F. Romeo and A. Sangiovanni-Vincentelli, “A theoretical framework for simulated annealing,” Algorithmica,vol. 6, pp. 302–345, 1991.


23. H. Shvaytser, “Learnable and nonlearnable visual concepts,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 12, pp. 459–466, 1990.

24. L.G. Valiant, “A theory of the learnable,” Comm. ACM, vol. 27, pp. 1134–1142, 1984.25. K. Verbeurgt, “Learning DNF under the uniform distribution in quasi-polynomial time,” in Proc. of the 3rd

Annual Workshop on Computational Learning Theory, 1990, pp. 314–326.26. D.F. Wong and C.L. Liu, “Floorplan design of VLSI circuits,” Algorithmica, vol. 4, pp. 263–291, 1989.

Documents

Approximation of Boolean Functions by Local Search