519
To the memory of Andrei Muchnik

To the memory of Andrei Muchnik - Page d'accueil / Lirmm.fr / - …ashen/kolmbook-eng.pdf · 2019-12-19 · To the memory of Andrei Muchnik. Preface The notion of algorithmic complexity

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • To the memory of Andrei Muchnik

  • Preface

    The notion of algorithmic complexity (also sometimes called “algorithmic en-tropy”) appeared in 1960s inbetween the theory of computation, probability theoryand information theory.

    The idea of A.N. Kolmogorov was to measure the amount of information infinite objects (and not in random variables, as it is done in classical Shannon infor-mation theory). His famous paper [77], published in 1965, explains how this canbe done (up to a bounded additive term) using the algorithmic approach.

    Similar ideas was suggested few years earlier by R. Solomonoff (see [185] andhis other papers; the historical account and reference can be found in [102]).1

    The motivation of Solomonoff was quite different. He tried to define the notionof “a priori probability”. Imagine there is some experiment (random process) andwe know nothing about its internal structure. Can we say something about theprobabilities of different outcomes in this situation? One can relate this to thecomplexity measures saying that simple objects have greater a priori probabilitythan complex ones. (Unfortunately, Solomonoff’s work become popular only afterKolmogorov mentioned it in his paper.)

    In 1965 G. Chaitin (then an 18-years-old undergaduate student) submittedtwo papers [28] and [29]; they were published in 1966 and 1969 respectively. Inthe second paper he proposed the same definition of algorithmic complexity asKolmogorov.

    The basic properties of Kolmogorov complexity were established in 1970s.Working independently, C.P. Schnorr and L. Levin (who was a student of Kol-mogorov) found a link between complexity and the notion of algorithmic random-ness (introduced in 1966 by P. Martin-Löf [114]). To achieve this, they introduceda slightly different version of complexity, the so-called monotone complexity. Alsothe Solomonoff’s ideas about a priori probability were formalized in the form of pre-fix complexity, introduced by Levin and later by Chaitin. The notions of complexityturned out to be useful both for theory of computation and probability theory.

    The Kolmogorov complexity became popular (and for a good reason: it is abasic and philosophically important notion of algorithms theory) after M. Li andP. Vitányi published a book on the subject [102] (first edition appeared in 1993).Almost everything about Kolmogorov complexity that was known at the momentwas covered in the book or at least mentioned as an exercise. This book alsoprovided a detailed historical account, references to first publications, etc. Thenthe books of C. Calude [25], and A. Nies [146] appeared, as well as the book ofR. Downey and D. Hirschfeldt [49]. These books cover many interesting results

    1Kolmogorov wrote in [78]: “I came to a similar notion not knowing about Solomonoff’s

    work”.

    3

  • 4 PREFACE

    obtained recently (in particular, the results that relate complexity and randomnesswith classical recursion theory).

    Our book does not try to be comprehensive (in particular, we do not say muchabout the recent results mentioned above). Instead, we tried to select the mostimportant topics and results (both from technical and philosophical viewpoint)and explain them clearly. We do not say much about the history of the topic: asit is usually done in textbooks, we formulate most statements without references,and this does not mean (of course) any authorship claim.

    We start the book with a section “What is this book about?” where we try tooverview briefly the main ideas and topics related to Kolmogorov complexity andalgorithmic randomness, so the reader can browse this section to decide whetherthe book is worth reading.

    As an appendix we reproduce the (English translation) of a small brochurewritten by one of the authors (V.U.), based on his talk for high school studentsand undergraduates (July 23, 2005) delivered during the “Modern Mathematics”Summer School (Dubna near Moscow); the brochure was published in 2006 byMCCME publishing house (Moscow). The lecture was devoted to different notionsof algorithmic randomness, and the reader who has no time or incentive to studythe corresponding chapters of the book in detail, still can get some acquaintancewith this topic.

    Unfortunately, the notation and terminology related to Kolmogorov complexityis not very logical (and different people often use different notation). Even the sameauthors used different notation in different papers. For example, Kolmogorov usedboth the letters 𝐾 and 𝐻 in his two basic publications [77, 78]. In [77] he used theterm “complexity” and denoted the complexity of a string 𝑥 by 𝐾(𝑥). Later he usedfor the same notion the term “entropy” used in Shannon information theory (andearlier in physics). Shannon information theory is based on probability theory;Kolmogorov had an ambitious plan to construct a parallel theory that does notdepend on the notion of probability. In [78] Kolmogorov wrote, using the sameword ‘entropy’ in this new sense:

    The ordinary definition of entropy uses probability concepts, andthus does not pertain to individual values, but to random val-ues, i.e., to probability distributions within a group of values.⟨. . .⟩ By far, not all applications of information theory fit ratio-nally into such an interpretation of its basic concepts. I believethat the need for attaching definite meanings to the expressions𝐻(𝑥|𝑦) and 𝐼(𝑥|𝑦), in the case of individual values 𝑥 and 𝑦 thatare not viewed as a result of random tests with a definite lawof distribution, was realized long ago by many who dealt withinformation theory.

    As far as I know, the first paper published on the idea ofrevising information theory so as to satisfy the above conditionswas the article of Solomonoff [185]. I came to similar conclu-sions, before becoming aware of Solomonoff’s work in 1963–1964,and published my first article on the subject [77] in early 1965.⟨. . .⟩

  • PREFACE 5

    The meaning of the new definition is very simple. Entropy𝐻(𝑥|𝑦) is the minimal [bit] length of a ⟨. . .⟩ program 𝑃 that per-mits construction of the value of 𝑥, the value of 𝑦 being known,

    𝐻(𝑥 |𝑦) = min𝐴(𝑃,𝑦)=𝑥

    𝑙(𝑃 ).

    This concept is supported by the general theory of “computable”(partially recursive) functions, i.e., by theory of algorithms ingeneral.

    ⟨. . .⟩ The preceding rather superficial discourse should provetwo general theses.

    1) Basic information theory concepts must and can befounded without recourse to the probability theory, and in sucha manner that “entropy” and “mutual information” concepts areapplicable to individual values.

    2) Thus introduced, information theory concepts can formthe basis of the term random, which naturally suggests that ran-domness is the absence of regularities.2

    And earlier (April 23, 1965), giving a talk “The notion of information and thefoundations of the probability theory” at the Institute of Philosophy of the USSRAcademy of Sciences, Kolmogorov said:

    So the two problems arise sequentially:1. Is it possible to free the information theory (and the

    notion of the “amount of information”) from probabilities?2. It is possible to develop the intuitive idea of randomness

    as incompressibility (the law describing the object cannot beshortened)?

    (the transcript of his talk was published in [84] on p. 126).So Kolmogorov uses the term “entropy” for the same notion that was named

    “complexity” in his first paper, and denotes it by letter 𝐻 instead of 𝐾.Later the same notion was denoted by 𝐶 (see, e.g., [102]) while the letter 𝐾

    is used for prefix complexity (denoted by KP(𝑥) in the Levin’s papers where prefixcomplexity was introduced).

    Unfortunately, the attempts to unify the terminology and notation made bydifferent people (including the authors) have lead mostly to increasing confusion.In the English version of this book we follow the terminology that is most usednowadays, with few exceptions, and mention the other notations used. For reader’sconvenience, the list of the used notation and the index are provided.

    2The published English version of this paper says “random is the absence of periodicity”, but

    this evidently is a translation error, and we correct the text following the Russian version.

  • 6 PREFACE

    * * *In the beginning of 1980s Kolmogorov (with the assistance of A. Semenov)

    initiated a seminar at the Mathematics and Mechanics Department of MoscowState (Lomonosov) University called “Description and computation complexity”;now the seminar (still active) is known as “Kolmogorov seminar”. The authors aredeeply grateful to their colleagues working in this seminar, including A. Zvonkin,E. Asarin, V. Vovk (they were Kolmogorov’s students), S. Soprunov, V. Vyu-gin, A. Romashchenko, M. Vyalyi, S. Tarasov, A. Chernov, M. Vyugin, S. Posit-selsky, K. Makarychev, Yu. Makarychev, M. Ushakov, M. Ustinov, S. Salnikov,A. Rumyantsev, D. Musatov, V. Podolskii, I. Mezhirov, Yu. Pritykin, M. Raskin,A. Khodyrev, P. Karpovich, A. Minasyan, E. Kalinina, G. Chelnokov, I. Razen-shteyn, M. Andreev, A. Savin, M. Dektyarev, A. Savchik, A. Kumok, V. Arzu-manyan, A. Makhlin, G. Novikov, A. Milovanov; the book would not be possiblewithout them.

    The frog drawing for the cover was made by Marina Feigelman; the cover itselfwas designed by Olga Lehtonen. As usual, we are grateful (in particular, for thehelp in the preparation of a camera-ready copy for the Russian edition) to VictorShuvalov.

    Authors were supported by International Science Foundation (Soros founda-tion), STINT (Sweden), Russian Fund for Basic Research (01-01-00493-a, 01-01-01028-a, 06-01-00122-a, 09-01-00709-a, 12-01-00864-a), CNRS and ANR (France,ANR-08-EMER-008 NAFIT and ANR-15-CE40-0016-01 RaCAF grants).

    The book was made possible by the generous support of our colleagues, in-cluding Bruno Bauwens, Laurent Bienvenu, Harry Buhrman, Cris Calude, BrunoDurand, Péter Gács, Denis Hirschfeldt, Rupert Hölzl,, Mathieu Hoyrup, MichalKoucký, Leonid Levin, Wolfgang Merkle, Joseph Miller, Andre Nies, ChristopherPorter, Jan Reimann, Jason Rute, Michael Sipser, Steven Simpson, Paul Vitányi,Sergey Vorobyov, and many others.

    We are thankful to American Mathematical Society (in particular, SergeyGelfand) for the suggestion to submit the book for publication in their book pro-gram and for the kind permission to keep the book available freely in electronicform at our homepages. We thank the (anonymous) referees for their attention andsuggestions, and the language editors for correcting our English errors.

    For many years the authors had the privilege to work in a close professional andpersonal contact with Andrej Muchnik (1958–2007), an outstanding mathematician,deep thinker and an admirable person, who participated in the work of Kolmogorovseminar and inspired a lot of work done in this seminar. We devote this book tohis memory.

    A. Shen, V. Uspensky, N. Vereshchagin September 1, 2016

  • Contents

    Preface 3

    What is this book about? 11What is Kolmogorov complexity? 11Optimal description modes 12Kolmogorov complexity 14Complexity and information 15Complexity and randomness 18Non-computability of 𝐶 and Berry’s paradox 19Some applications of Kolmogorov complexity 20

    Basic notions and notations 25

    Chapter 1. Plain Kolmogorov complexity 291.1. The definition and main properties 291.2. Algorithmic properties 35

    Chapter 2. Complexity of pairs and conditional complexity 452.1. Complexity of pairs 452.2. Conditional complexity 482.3. Complexity as the amount of information 58

    Chapter 3. Martin-Löf randomness 673.1. Measures on Ω 673.2. The Strong Law of Large Numbers 693.3. Effectively null sets 723.4. Properties of Martin-Löf randomness 793.5. Randomness deficiencies 84

    Chapter 4. A priori probability and prefix complexity 894.1. Randomized algorithms and semimeasures on N 894.2. Maximal semimeasures 934.3. Prefix machines 964.4. A digression: machines with self-delimiting input 994.5. The main theorem on prefix complexity 1054.6. Properties of prefix complexity 1104.7. Conditional prefix complexity and complexity of pairs 116

    Chapter 5. Monotone complexity 1295.1. Probabilistic machines and semimeasures on the tree 1295.2. Maximal semimeasure on the binary tree 135

    7

  • 8 CONTENTS

    5.3. A priory complexity and its properties 1365.4. Computable mappings of type Σ → Σ 1405.5. Monotone complexity 1435.6. Levin–Schnorr theorem 1585.7. The random number Ω 1705.8. Effective Hausdorff dimension 1855.9. Randomness with respect to different measures 189

    Chapter 6. General scheme for complexities 2056.1. Decision complexity 2056.2. Comparing complexities 2096.3. Conditional complexities 2126.4. Complexities and oracles 214

    Chapter 7. Shannon entropy and Kolmogorov complexity 2257.1. Shannon entropy 2257.2. Pairs and conditional entropy 2297.3. Complexity and entropy 237

    Chapter 8. Some applications 2458.1. There are infinitely many primes 2458.2. Moving information along the tape 2458.3. Finite automata with several heads 2488.4. Laws of Large Numbers 2508.5. Forbidden substrings 2538.6. A proof of an inequality 2678.7. Lipschitz transformations are not transitive 270

    Chapter 9. Frequency and game approaches to randomness 2739.1. The original idea of von Mises 2739.2. Set of strings as selection rules 2749.3. Mises–Church randomness 2769.4. Ville’s example 2799.5. Martingales 2829.6. A digression: martingales in probability theory 2879.7. Lower semicomputable martingales 2899.8. Computable martingales 2919.9. Martingales and Schnorr randomness 2949.10. Martingales and effective dimension 2969.11. Partial selection rules 2999.12. Non-monotonic selection rules 3029.13. Change in the measure and randomness 308

    Chapter 10. Inequalities for entropy, complexity and size 32310.1. Introduction and summary 32310.2. Uniform sets 32810.3. A construction of a uniform set 33110.4. Uniform sets and orbits 33310.5. Almost uniform sets 33410.6. Typization trick 335

  • CONTENTS 9

    10.7. Combinatorial interpretation: examples 33810.8. Combinatorial interpretation: the general case 34010.9. One more combinatorial interpretation 34210.10. The inequalities for two and three strings 34510.11. Dimensions and Ingleton’s inequality 34710.12. Conditionally independent random variables 35210.13. Non-Shannon inequalities 353

    Chapter 11. Common information 35911.1. Incompressible representations of strings 35911.2. Representing mutual information as a string 36011.3. The combinatorial meaning of common information 36511.4. Conditional independence and common information 370

    Chapter 12. Multisource algorithmic information theory 37512.1. Information transmission requests 37512.2. Conditional encoding 37612.3. Conditional codes: Muchnik’s theorem 37712.4. Combinatorial interpretaion of Muchnik’s theorem 38112.5. A digression: on-line matching 38312.6. Information distance and simultaneous encoding 38512.7. Conditional codes for two conditions 38712.8. Information flow and network cuts 39112.9. Networks with one source 39212.10. Common information as an information request 39612.11. Simplifying a program 39712.12. Minimal sufficient statistics 397

    Chapter 13. Information and logic 40913.1. Problems, operations, complexity 40913.2. Problem complexity and intuitionistic logic 41113.3. Some formulas and their complexity 41313.4. More examples and the proof of Theorem 238 41613.5. Proof of a result similar to Theorem 238 using Kripke models 42113.6. A problem whose complexity is not expressible in terms of the

    complexities of tuples 425

    Chapter 14. Algorithmic statistics 43314.1. The framework and randomness deficiency. 43314.2. Stochastic objects 43614.3. Two-part descriptions 43914.4. Hypotheses of restricted type 44614.5. Optimality and randomness deficiency 45514.6. Minimal hypotheses 45814.7. A bit of philosophy 460

    Appendix 1. Complexity and foundations of probability 463Probability theory paradox 463Current best practice 463Simple events and events specified in advance 464

  • 10 CONTENTS

    Frequency approach 466Dynamical and statistical laws 467Are “real-life” sequences complex? 467Randomness as ignorance: Blum–Micali–Yao pseudorandomness 468A digression: thermodynamics 469Another digression: quantum mechanics 471

    Appendix 2. Four algorithmic faces of randomness 473Introduction 473Face One: Frequency stability and stochasticness 476Face Two: Chaoticness 478Face Three: Typicalness 483Face Four: Unpredictability 484Generalization for arbitrary computable distributions 488History and bibliography 494

    Bibliography 499

    Index 511

    Glossary 517

  • What is this book about?

    What is Kolmogorov complexity?

    Roughly speaking, Kolmogorov complexity means “compressed size”. Pro-grams like zip, gzip, bzip2, compress, rar, arj, etc., compress a file (text, image,or some other data) into a presumably shorter one. The original file can then berestored by a “decompressing” program (sometimes both compression and decom-pression are performed by the same program). Note that we consider here onlylossless compression.

    A file that has a regular structure can be compressed significantly. Its com-pressed size is small compared to its length. On the other hand, a file withoutregularities hardly can be compressed, and its compressed size is close to its origi-nal size.

    This explanation is very informal and contains several inaccuracies—both tech-nical and more essential. First, instead of files (sequences of bytes) we will considerbinary strings (finite sequences of bits, that is, of zeros and ones). The length ofsuch a string is the number of symbols in it. (For example, the string 1001 haslength 4, and the empty string has length 0.)

    Here are the more essential points:

    ∙ We consider only decompressing programs; we do not worry at all aboutcompression. More specifically, a decompressor is any algorithm (a pro-gram) that receives a binary string as an input and returns a binary stringas an output. If a decompressor 𝐷 on input 𝑥 terminates and returnsstring 𝑦, we write 𝐷(𝑥) = 𝑦 and say that 𝑥 is a description of 𝑦 withrespect to 𝐷. Decompressors are also called description modes.

    ∙ A description mode is not required to be total. For some 𝑥, the compu-tation 𝐷(𝑥) may never terminate and therefore produces no result. Alsowe do not put any constraints on the computation time of 𝐷: on someinputs the program 𝐷 may halt only after an extremely long time.

    Using the recursion theory terminology, we say that a description mode is apartial computable (=partial recursive) function from Ξ to Ξ, where Ξ = {0, 1}*stands for the set of all binary strings. Let us recall that we associate with everyalgorithm 𝐷 (whose inputs and outputs are binary strings) a function 𝑑 computedby 𝐷; namely, 𝑑(𝑥) is defined for a string 𝑥 if and only if 𝐷 halts on 𝑥, and 𝑑(𝑥) isthe output of 𝐷 on 𝑥. A partial function from Ξ to Ξ is called computable if it isassociated with (=computed by) some algorithm 𝐷. Usually we use the same letterto denote the algorithm and the function it computes. So we write 𝐷(𝑥) instead of𝑑(𝑥) unless it causes a confusion.

    Assume that a description mode (a decompressor) 𝐷 is fixed. (Recall that 𝐷 iscomputable according to our definitions.) For a string 𝑥 consider all its descriptions,

    11

  • 12 WHAT IS THIS BOOK ABOUT?

    that is, all 𝑦 such that 𝐷(𝑦) is defined and equals 𝑥. The length of the shorteststring 𝑦 among them is called the Kolmogorov complexity of 𝑥 with respect to 𝐷:

    𝐶𝐷(𝑥) = min{ 𝑙(𝑦) | 𝐷(𝑦) = 𝑥}.Here 𝑙(𝑦) denotes the length of the string 𝑦; we use this notation throughout thebook. The subscript 𝐷 indicates that the definition depends on the choice of thedescription mode 𝐷. The minimum of the empty set is defined as +∞, thus 𝐶𝐷(𝑥)is infinite for all the strings 𝑥 outside the range of the function 𝐷 (they have nodescriptions).

    At first glance this definition seems to be meaningless, as for different 𝐷 we ob-tain quite different notions, including ridiculous ones. For instance, if 𝐷 is nowheredefined, then 𝐶𝐷 is infinite everywhere. If 𝐷(𝑦) = Λ (the empty string) for all 𝑦,then the complexity of the empty string is 0 (since 𝐷(Λ) = Λ and 𝑙(Λ) = 0), andthe complexity of all the other strings is infinite.

    A more reasonable example: consider a decompressor 𝐷 that just copies itsinput to output, that is, 𝐷(𝑥) = 𝑥 for all 𝑥. In this case every string is its owndescription and 𝐶𝐷(𝑥) = 𝑙(𝑥).

    Of course, for any given string 𝑥 we can find a description mode 𝐷 that istailored to 𝑥 and with respect to which 𝑥 has small complexity. Indeed, let 𝐷(Λ) =𝑥. This implies 𝐶𝐷(𝑥) = 0.

    More general, if we have some class of strings, we may look for a descriptionmode that favors all the strings in this class. For example, for the class of stringsconsisting of zeros only we may consider the following decompressor:

    𝐷(bin(𝑛)) = 000 . . . 000 (𝑛 zeros),

    where bin(𝑛) stands for the binary notation of natural number 𝑛. The length ofthe string bin(𝑛) is about log2 𝑛 (does not exceed log2 𝑛 + 1). With respect tothis description mode, the complexity of the string consisting of 𝑛 zeros is closeto log2 𝑛. This is much less that the length of the string (𝑛). On the other hand,all strings containing symbol 1 have infinite complexity 𝐶𝐷.

    It may seem that the dependence of complexity on the choice of the decom-pressor makes impossible any general theory of complexity. However, it is not thecase.

    Optimal description modes

    A description mode is better when descriptions are shorter. According to this,we say that a description mode (decompressor) 𝐷1 is not worse than a descriptionmode 𝐷2 if

    𝐶𝐷1(𝑥) 6 𝐶𝐷2(𝑥) + 𝑐

    for some constant 𝑐 and for all strings 𝑥.Let us comment on the role of the constant 𝑐 in this definition. We consider a

    change in the complexity bounded by a constant as “negligible”. One could say thatsuch a tolerance makes the complexity notion practically useless, as the constant 𝑐can be very large. However, nobody managed to get any reasonable theory thatovercomes this difficulty and defines complexity with better precision.

    Example. Consider two description modes (decompressors) 𝐷1 and 𝐷2. Letus show that there exists a description mode 𝐷 which is not worse than both of

  • OPTIMAL DESCRIPTION MODES 13

    them. Indeed, let

    𝐷(0𝑦) = 𝐷1(𝑦),

    𝐷(1𝑦) = 𝐷2(𝑦).

    In other words, we consider the first bit of a description as the index of a descriptionmode and the rest as the description (for this mode).

    If 𝑦 is a description of 𝑥 with respect to 𝐷1 (or 𝐷2), then 0𝑦 (respectively, 1𝑦)is a description of 𝑥 with respect to 𝐷 as well. This description is only one bitlonger, therefore we have

    𝐶𝐷(𝑥) 6 𝐶𝐷1(𝑥) + 1,

    𝐶𝐷(𝑥) 6 𝐶𝐷2(𝑥) + 1

    for all 𝑥. Thus the mode 𝐷 is not worse than both 𝐷1 and 𝐷2.This idea is often used in practice. For instance, a zip-archive has a preamble;

    the preamble says (among other things) which mode was used to compress thisparticular file, and the compressed file follows the preamble.

    If we want to use 𝑁 different compression modes, we need to reserve initiallog2 𝑁 bits for the index of the compression mode.

    Using a generalization of this idea, we can prove the following theorem:

    Theorem 1 (Solomonoff–Kolmogorov). There is a description mode 𝐷 that isnot worse than any other one: for every description mode 𝐷′ there is a constant 𝑐such that

    𝐶𝐷(𝑥) 6 𝐶𝐷′(𝑥) + 𝑐

    for every string 𝑥.

    A description mode 𝐷 having this property is called optimal.

    Proof. Recall that a description mode by definition is a computable function.Every computable function has a program. We assume that programs are binarystrings. Moreover, we assume that reading the program bits from left to right wecan determine uniquely where it ends, that is, programs are “self-delimiting”. Notethat every programming language can be modified in such a way that programs areself-delimiting. For instance, we can double every bit of a given program (changing0 to 00 and 1 to 11) and append the pattern 01 to its end.

    Define now a new description mode 𝐷 as follows:

    𝐷(𝑃𝑦) = 𝑃 (𝑦)

    where 𝑃 is a program (in the chosen self-delimiting programming language) and𝑦 is any binary string. That is, the algorithm 𝐷 scans the input string from theleft to the right and extracts a program 𝑃 from the input. (If the input does notstart with a valid program, 𝐷 does whatever it wants, say, goes into an infiniteloop. The self-delimiting property guarantees that the decomposition of input isunique: if 𝑃𝑦 = 𝑃 ′𝑦′ for two programs 𝑃 and 𝑃 ′, then one of the programs is aprefix of the other one.) Then 𝐷 applies the extracted program 𝑃 to the rest of theinput (𝑦) and returns the obtained result. (So 𝐷 is just an “universal algorithm”,or “interpreter”; the only difference is that program and input are not separatedand therefore we need to use a self-delimiting programming language.)

    Let us show that indeed 𝐷 is not worse than any other description mode 𝑃 . Weassume that the program 𝑃 is written in the chosen self-delimiting programming

  • 14 WHAT IS THIS BOOK ABOUT?

    language. If 𝑦 is a shortest description of the string 𝑥 with respect to 𝑃 , then 𝑃𝑦is a description of 𝑥 with respect to 𝐷 (though not necessarily a shortest one).Therefore, compared to 𝑃 , the shortest description is at most 𝑙(𝑃 ) bits longer, and

    𝐶𝐷(𝑥) 6 𝐶𝑃 (𝑥) + 𝑙(𝑃 ).

    The constant 𝑙(𝑃 ) depends only on the description mode 𝑃 (and not on 𝑥). �

    Basically, we used the same trick as in the preceding example, but insteadof merging two description modes we join all of them. Each description modeis prefixed by its index (program, identifier). The same idea is used in practice.A “self-extracting archive” is an executable file starting with a small program (adecompressor); the rest is considered as an input to that program. This programis loaded into the memory and then it decompresses the rest of the file.

    Note that in our construction optimal decompressor works for a very long timeon some inputs (as some programs have large running time), and is undefined onsome other inputs.

    Kolmogorov complexity

    Fix an optimal description mode 𝐷 and call 𝐶𝐷(𝑥) the Kolmogorov complexityof the string 𝑥. In the notation 𝐶𝐷(𝑥) we drop the subscript 𝐷 and write just 𝐶(𝑥).

    If we switch to another optimal description mode, the change in complexity isbounded by an additive constant: for every optimal description modes 𝐷1 and 𝐷2there is a constant 𝑐(𝐷1, 𝐷2) such that

    |𝐶𝐷1(𝑥) − 𝐶𝐷2(𝑥)| 6 𝑐(𝐷1, 𝐷2)for all 𝑥. Sometimes this inequality is written as follows:

    𝐶𝐷1(𝑥) = 𝐶𝐷2(𝑥) + 𝑂(1),

    where 𝑂(1) stands for a bounded function of 𝑥.Could we then consider the Kolmogorov complexity of a particular string 𝑥

    without having in mind a specific optimal description mode used in the definitionof 𝐶(𝑥)? No, since by adjusting the optimal description mode we can make thecomplexity of 𝑥 arbitrarily small or arbitrarily large. Similarly, the relation “string 𝑥is simpler than 𝑦”, that is, 𝐶(𝑥) < 𝐶(𝑦), has no meaning for two fixed strings 𝑥and 𝑦: by adjusting the optimal description mode we can make any of these twostrings simpler than the other one.

    One may wonder then whether Kolmogorov complexity has any sense at all.Trying to defend this notion, let us recall the construction of the optimal descriptionmode used in the proof of the Solomonoff–Kolmogorov theorem. This constructionuses some programming language, and two different choices of this language leadto two complexities that differ at most by a constant. This constant is in fact thelength of the program that is written in one of these two languages and interpretsthe other one. If both languages are “natural”, we can expect this constant to benot that huge, just several thousands or even several hundreds. Therefore if wespeak about strings whose complexity is, say, about 105 (i.e., a text of a long andnot very compressible novel), or 106 (which is reasonable for DNA strings, unlessthey are compressible much more than the biologists think now), then the choiceof the programming language is not that important.

    Nevertheless one should always have in mind that all statements about Kol-mogorov complexity are inherently asymptotic: they involve infinite sequences of

  • COMPLEXITY AND INFORMATION 15

    strings. This situation is typical also for computational complexity: usually upperand lower bounds for complexity of some computational problem are asymptoticbounds.

    Complexity and information

    One can consider the Kolmogorov complexity of 𝑥 as the amount of informa-tion in 𝑥. Indeed, a string of zeros, which has a very short description, has littleinformation, and a chaotic string, which cannot be compressed, has a lot of informa-tion (although that information can be meaningless—we do not try to distinguishbetween meaningful and meaningless information; so, in our view, any abracadabrahas much information unless it has a short description).

    If the complexity of a string 𝑥 is equal to 𝑘, we say that 𝑥 has 𝑘 bits ofinformation. One can expect that the amount of information in a string does notexceed its length, that is, 𝐶(𝑥) 6 𝑙(𝑥). This is true (up to an additive constant, aswe have already said).

    Theorem 2. There is a constant 𝑐 such that

    𝐶(𝑥) 6 𝑙(𝑥) + 𝑐

    for all strings 𝑥.

    Proof. Let 𝐷(𝑦) = 𝑦 for all 𝑦. Then 𝐶𝐷(𝑥) = 𝑙(𝑥). By optimality, thereexists some 𝑐 such that

    𝐶(𝑥) 6 𝐶𝐷(𝑥) + 𝑐 = 𝑙(𝑥) + 𝑐

    for all 𝑥. �

    Usually this statement is written as follows: 𝐶(𝑥) 6 𝑙(𝑥) + 𝑂(1). Theorem 2implies, in particular, that Kolmogorov complexity is always finite, that is, everystring has a description.

    Here is another property of “amount of information” that one can expect: theamount of information does not increase when algorithmic transformation is per-formed. (More precisely, the increase is bounded by an additive constant dependingon the transformation algorithm.)

    Theorem 3. For every algorithm 𝐴 there exists a constant 𝑐 such that

    𝐶(𝐴(𝑥)) 6 𝐶(𝑥) + 𝑐

    for all 𝑥 such that 𝐴(𝑥) is defined.

    Proof. Let 𝐷 be an optimal decompressor that is used in the definition of theKolmogorov complexity. Consider another decompressor 𝐷′:

    𝐷′(𝑝) = 𝐴(𝐷(𝑝)).

    (We apply first 𝐷 and then 𝐴.) If 𝑝 is a description of a string 𝑥 with respect to 𝐷and 𝐴(𝑥) is defined, then 𝑝 is a description of 𝐴(𝑥) with respect to 𝐷′. Let 𝑝 be ashortest description of 𝑥 with respect to 𝐷. Then we have

    𝐶𝐷′(𝐴(𝑥)) 6 𝑙(𝑝) = 𝐶𝐷(𝑥) = 𝐶(𝑥).

    By optimality we obtain

    𝐶(𝐴(𝑥)) 6 𝐶𝐷′(𝐴(𝑥)) + 𝑐 6 𝐶(𝑥) + 𝑐

    for some 𝑐 and all 𝑥. �

  • 16 WHAT IS THIS BOOK ABOUT?

    This theorem implies that the amount of information “does not depend on thespecific encoding”. For instance, if we reverse all bits of some string (replace 0by 1 and vice versa), or add a zero bit after each bit of that string, the resultingstring has the same Kolmogorov complexity as the original one (up to an additiveconstant). Indeed, the transformation itself and its inverse can be performed by analgorithm.

    Here is one more example of a natural property of Kolmogorov complexity. Let𝑥 and 𝑦 be strings. How much information has their concatenation 𝑥𝑦? We expectthat the quantity of information in 𝑥𝑦 does not exceed the sum of those in 𝑥 and 𝑦.This is indeed true, however, a small additive term is needed.

    Theorem 4. There is a constant 𝑐 such that for all 𝑥 and 𝑦

    𝐶(𝑥𝑦) 6 𝐶(𝑥) + 2 log𝐶(𝑥) + 𝐶(𝑦) + 𝑐

    Proof. Let us try first to prove the statement in a stronger form, withoutthe term 2 log𝐶(𝑥). Let 𝐷 be the optimal description mode that is used in thedefinition of Kolmogorov complexity. Define the following description mode 𝐷′.If 𝐷(𝑝) = 𝑥 and 𝐷(𝑞) = 𝑦 we consider 𝑝𝑞 as a description of 𝑥𝑦, that is, we let𝐷′(𝑝𝑞) = 𝑥𝑦. Then the complexity of 𝑥𝑦 with respect to 𝐷′ does not exceed thelength of 𝑝𝑞, that is, 𝑙(𝑝) + 𝑙(𝑞). If 𝑝 and 𝑞 are minimal descriptions, we obtain𝐶𝐷′(𝑥𝑦) 6 𝐶𝐷(𝑥) +𝐶𝐷(𝑦). By optimality the same inequality holds for 𝐷 in placeof 𝐷′, up to an additive constant.

    What is wrong with this argument? The problem is that 𝐷′ is not well defined.We let 𝐷′(𝑝𝑞) = 𝐷(𝑝)𝐷(𝑞). However, 𝐷′ has no means to separate 𝑝 from 𝑞. It mayhappen that there are two ways to split the input into 𝑝 and 𝑞 yielding differentresults:

    𝑝1𝑞1 = 𝑝2𝑞2 but 𝐷(𝑝1)𝐷(𝑞1) ̸= 𝐷(𝑝2)𝐷(𝑞2).

    There are two ways to fix this bug. The first one, which we use now, goesas follows. Let us prepend the string 𝑝𝑞 by the length 𝑙(𝑝) of string 𝑝 (in binarynotation). This allows us to separate 𝑝 and 𝑞. However, we need to find where𝑙(𝑝) ends, so let us double all the bits in the binary representation of 𝑙(𝑝) and thenput 01 as separator. More specifically, let bin(𝑘) denote the binary representationof integer 𝑘 and let 𝑥 be the result of doubling each bit in 𝑥. (For example,

    bin(5) = 101, and bin(5) = 110011.) Let

    𝐷′( bin(𝑙(𝑝)) 01𝑝𝑞) = 𝐷(𝑝)𝐷(𝑞).

    Thus 𝐷′ is well defined: the algorithm 𝐷′ scans bin(𝑙(𝑝)) while all the digits aredoubled. Once it sees 01, it determines 𝑙(𝑝) and then scans 𝑙(𝑝) digits to find 𝑝.The rest of the input is 𝑞 and the algorithm is able to compute 𝐷(𝑝)𝐷(𝑞).

    Now we see that 𝐶𝐷′(𝑥𝑦) is at most 2𝑙(bin(𝑙(𝑝))) + 2 + 𝑙(𝑝) + 𝑙(𝑞). The lengthof the binary representation of 𝑙(𝑝) is at most log2 𝑙(𝑝) + 1. Therefore, 𝑥𝑦 has adescription of length at most 2 log2 𝑙(𝑝) + 4 + 𝑙(𝑝) + 𝑙(𝑞) with respect to 𝐷

    ′, whichimplies the statement of the theorem. �

    The second way to fix the bug mentioned above goes as follows. We couldmodify the definition of Kolmogorov complexity by requiring descriptions to be“self-delimiting”; we discuss this approach in detail in Chapter 4.

  • COMPLEXITY AND INFORMATION 17

    Note also that we can exchange 𝑝 and 𝑞 and thus prove that

    𝐶(𝑥𝑦) 6 𝐶(𝑥) + 𝐶(𝑦) + 2 log2 𝐶(𝑦) + 𝑐.

    How tight is the inequality of Theorem 4? Can 𝐶(𝑥𝑦) be much less than𝐶(𝑥) +𝐶(𝑦)? According to our intuition, this happens when 𝑥 and 𝑦 have much incommon. For example, if 𝑥 = 𝑦, we have 𝐶(𝑥𝑦) = 𝐶(𝑥𝑥) = 𝐶(𝑥) + 𝑂(1), since 𝑥𝑥can be algorithmically obtained from 𝑥 and vice versa (Theorem 3).

    To refine this observation we will define the notion of the quantity of informationin 𝑥 that is missing in 𝑦 (for every strings 𝑥 and 𝑦). This value is called the theKolmogorov complexity of 𝑥 conditional to 𝑦 (or “given 𝑦”) and denoted by 𝐶(𝑥 |𝑦).Its definition is similar to the definition of the unconditional complexity. This timethe decompressor 𝐷 has access not only to the (compressed) description, but alsoto the string 𝑦. We will discuss this notion later in Section 2. Here we mentiononly that the following equality holds:

    𝐶(𝑥𝑦) = 𝐶(𝑦) + 𝐶(𝑥 |𝑦) + 𝑂(log 𝑛)for all strings 𝑥 and 𝑦 of complexity at most 𝑛. The equality reads as follows: theamount of information in 𝑥𝑦 is equal to the amount of information in 𝑦 plus theamount of new information in 𝑥 (“new” = missing in 𝑦).

    The difference 𝐶(𝑥)−𝐶(𝑥 |𝑦) can be considered as “the quantity of informationin 𝑦 about 𝑥”. It indicates how much the knowledge of 𝑦 simplifies 𝑥.

    Using the notion of conditional complexity we can ask questions like this: Howmuch new information has DNA of some organism compared to another organism’sDNA? If 𝑑1 is the binary string that encodes the first DNA and 𝑑2 is the binarystring that encodes the second DNA, then the value in question is 𝐶(𝑑1 |𝑑2). Sim-ilarly we can ask what percentage of information has been lost when translating anovel into another language: this percentage is the fraction

    𝐶(original |translation)/𝐶(original).

    The questions about information in different objects were studied before theinvention of algorithmic information theory. The information was measured usingthe notion of Shannon entropy. Let us recall its definition. Let 𝜉 be a randomvariable that takes 𝑛 values with probabilities 𝑝1, . . . , 𝑝𝑛. Then its Shannon entropy𝐻(𝜉) is defined as follows:

    𝐻(𝜉) =∑︁

    𝑝𝑖(− log2 𝑝𝑖).

    Informally, the outcome having probability 𝑝𝑖 carries log(1/𝑝𝑖) = − log2 𝑝𝑖 bits ofinformation (=surprise). Then 𝐻(𝜉) can be understood as the average amount ofinformation in an outcome of the random variable.

    Assume that we want to use Shannon entropy to measure the amount of infor-mation contained in some English text. To do this we have to find an ensemble oftexts and a probability distribution on this ensemble such that the text is “typical”with respect to this distribution. This makes sense for a short telegram, but for along text (say, a novel) such an ensemble is hard to imagine.

    The same difficulty arises when we try to define the amount of information inthe genome of some species. If we consider as the ensemble the set of the genomesof all existing species (or even all species ever existed), then the cardinality of thisset is rather small (it does not exceed 21000 for sure). And if we consider all its

  • 18 WHAT IS THIS BOOK ABOUT?

    elements as equiprobable, then we obtain a ridiculously small value (less than 1000bits); for the non-uniform distributions the entropy is even less.

    So we see that in these contexts Kolmogorov complexity looks like a moreadequate tool than Shannon entropy.

    Complexity and randomness

    Let us recall the inequality 𝐶(𝑥) 6 𝑙(𝑥) + 𝑂(1) (Theorem 2). For most ofthe strings its left hand side is close to the right hand side. Indeed, the followingstatement is true:

    Theorem 5. Let 𝑛 be an integer. Then there are less than 2𝑛 strings 𝑥 suchthat 𝐶(𝑥) < 𝑛.

    Proof. Let 𝐷 be the optimal description mode used in the definition of Kol-mogorov complexity. Then only strings 𝐷(𝑦) for all 𝑦 such that 𝑙(𝑦) < 𝑛 havecomplexity less than 𝑛. The number of such strings does not exceed the number ofstrings 𝑦 such that 𝑙(𝑦) < 𝑛, i.e., the sum

    1 + 2 + 4 + 8 + . . . + 2𝑛−1 = 2𝑛 − 1(there are 2𝑘 strings for each length 𝑘 < 𝑛). �

    This implies that the fraction of strings of complexity less than 𝑛− 𝑐 among allstrings of length 𝑛 is less than 2𝑛−𝑐/2𝑛 = 2−𝑐. For instance, the fraction of stringsof complexity less than 90 among all strings of length 100 is less than 2−10.

    Thus the majority of strings (of a given length) are incompressible or almostincompressible. In other words, a randomly chosen string of the given length isalmost incompressible. This can be illustrated by the following mental (or evenreal) experiment. Toss a coin, say, 80000 times and get a sequence of 80000 bits.Convert it into a file of size 10000 bytes (8 bits = 1 byte). One can bet that nocompression software (existing before the start of the experiment) can compress theresulting file by more than 10 bytes. Indeed, the probability of this event is lessthan 2−80 for every fixed compressor, and the number of (existing) compressors isnot so large.

    It is natural to consider incompressible strings as “random” ones: informallyspeaking, randomness is the absence of any regularities that may allow us to com-press the string. Of course, there is no strict borderline between “random” and“non-random” strings. It is ridiculous to ask which strings of length 3 (i.e., 000,001, 010, 011, 100, 101, 110, 111) are random and which are not.

    Another example: assume that we start with a “random” string of length 10000and replace its bits by all zeros (one bit at a step). At the end we get a certainlynon-random string (zeros only). But it would be näıve to ask at which step thestring has become non-random for the first time.

    Instead, we can naturally define the “randomness deficiency” of a string 𝑥 asthe difference 𝑙(𝑥) −𝐶(𝑥). Using this notion, we can restate Theorem 2 as follows:the randomness deficiency is almost non-negative (i.e., larger than a constant).Theorem 5 says that the randomness deficiency of a string of length 𝑛 is less than𝑑 with probability at least 1 − 1/2𝑑 (assuming that all strings are equiprobable).

    Now consider the Law of Large Numbers; it says that most of the 𝑛-bit stringshave frequency of ones close to 1/2. This law can be translated into Kolmogorovcomplexity language as follows: the frequency of ones in every string with small

  • NON-COMPUTABILITY OF 𝐶 AND BERRY’S PARADOX 19

    randomness deficiency is close to 1/2. This translation implies the original state-ment since most of the strings have small randomness deficiency. We will see laterthat actually these formulations are equivalent.

    If we insist on drawing a strict borderline between random and non-randomobjects, we have to consider infinite sequences instead of strings. The notion ofrandomness for infinite sequences of zeros and ones was defined by Kolmogorov’sstudent P. Martin-Löf (he came to Moscow from Sweden). We discuss it in Section 3.Later C. Schnorr and L. Levin found a characterization of randomness in terms ofcomplexity: an infinite binary sequence is random if and only if the randomnessdeficiency of its prefixes is bounded by a constant. This criterion, however, usesanother version of Kolmogorov complexity called monotone complexity.

    Non-computability of 𝐶 and Berry’s paradox

    Before discussing applications of Kolmogorov complexity, let us mention a fun-damental problem that reappears in any application. Unfortunately, the function 𝐶is not computable: there is no algorithm that given a string 𝑥 finds its Kolmogorovcomplexity. Moreover, there is no computable nontrivial (unbounded) lower boundfor 𝐶.

    Theorem 6. Let 𝑘 be a computable (not necessarily total) function from Ξto N. (In other words, 𝑘 is an algorithm that terminates on some binary stringsand returns natural numbers as results.) If 𝑘 is a lower bound for Kolmogorovcomplexity, that is, 𝑘(𝑥) 6 𝐶(𝑥) for all 𝑥 such that 𝑘(𝑥) is defined, then 𝑘 isbounded: all its values do not exceed some constant.

    The proof of this theorem is a reformulation of the so-called “Berry’s paradox”.This paradox considers

    the minimal natural number that cannot be defined by at mostfourteen English words.

    This phrase has exactly fourteen words and defines that number. Thus we get acontradiction.

    Following this idea consider the first binary string whose Kolmogorov complexityis greater than a given number 𝑁 . By definition, its complexity is greater than 𝑁 .On the other hand, this string has a short description that includes some fixedamount of information plus the binary notation of 𝑁 (which requires about log2 𝑁bits), and the total number of bits needed is much less than 𝑁 for large 𝑁 . Thatwould be a contradiction if we knew how to effectively find this string given itsdescription. Using the computable lower bound 𝑘, we can convert this paradox intothe proof.

    Proof. Consider the function 𝐵(𝑁) whose argument 𝑁 is a natural number;it is computed by the following algorithm:

    perform in parallel the computations 𝑘(Λ), 𝑘(0), 𝑘(1), 𝑘(00),𝑘(01), 𝑘(10), 𝑘(11), . . . until some string 𝑥 such that 𝑘(𝑥) > 𝑁appears; then return 𝑥.

    If the function 𝑘 is unbounded then the function 𝐵 is total and 𝑘(𝐵(𝑁)) > 𝑁by construction for every 𝑁 . As 𝑘 is a lower bound for 𝐾, we have 𝐶(𝐵(𝑁)) > 𝑁 .On the other hand 𝐵(𝑁) can be computed given the binary representation bin(𝑁)

  • 20 WHAT IS THIS BOOK ABOUT?

    of 𝑁 , therefore

    𝐶(𝐵(𝑁)) 6 𝐶(bin(𝑁)) + 𝑂(1) 6 𝑙(bin(𝑁)) + 𝑂(1) 6 log2 𝑁 + 𝑂(1)

    (the first inequality is provided by Theorem 3, the second one is provided by The-orem 2; term 𝑂(1) stands for a bounded function). So we obtain

    𝑁 < 𝐶(𝐵(𝑁)) 6 log2 𝑁 + 𝑂(1),

    which cannot happen if 𝑁 is large enough. �

    Some applications of Kolmogorov complexity

    Let us start with a disclaimer: the applications we will talk about are not real“practical” applications; we just establish relations between Kolmogorov complexityand other important notions.

    Occam’s razor. We start with a philosophical question. What do we meanwhen we say that a theory provides a good explanation for some experimental data?Assume that we are given some experimental data and there are several theories toexplain the data. For example, the data might be the observed positions of planetsin the sky. We can explain them as Ptolemy did, with epicycles and deferents,introducing extra corrections when needed. On the other hand, we can use thelaws of the modern mechanics. Why do we think that the modern theory is better?A possible answer: the modern theory can compute the positions of planets withthe same (or even better) accuracy given less parameters. In other words, Kepler’sachievement is a shorter description of the experimental data.

    Roughly speaking, experimenters obtain binary strings and theorists find shortdescriptions for them (thus proving upper bounds for complexities of those strings);the shorter the description is, the better is the theorist.

    This approach is sometimes called “Occam’s razor” and is attributed to thephilosopher William of Ockham who said that entities should not be multiplied be-yond necessity. It is hard to judge whether he would agree with such interpretationof his words.

    We can use the same idea in more practical contexts. Assume that we designa machine that reads handwritten zip codes on envelopes. We are looking for arule that separates, say, images of zeros from images of ones. An image is given asa Boolean matrix (or a binary string). We have several thousands of images andfor each image we know whether it means 0 or 1. We want to find a reasonableseparating rule (with the hope that it can be applied to the forthcoming images).What means “reasonable” in this context? If we just list all the images in our listtogether with their classification, we get a valid separation rule—at least it worksuntil we receive a new image—however, the rule is way too long. It is natural toassume that a reasonable rule must have a short description, that is, it must havelow Kolmogorov complexity.

    Foundations of probability theory. The probability theory itself, beingcurrently a part of measure theory, is mathematically sound and does not need anyextra “foundations”. The difficult questions arise, however, if we try to understandwhy this theory could be applied to the real world phenomena and how it shouldbe applied.

    Assume that we toss a coin thousand times (or test some other hardware ran-dom number generator) and get a bit string of length 1000. If this string contains

  • SOME APPLICATIONS OF KOLMOGOROV COMPLEXITY 21

    only zeros or equals 0101010101 . . . (zeros and ones alternate), then we definitelywill conclude that the generator is bad. Why?

    Usual explanation: the probability of obtaining thousand zeros is negligible(2−1000) provided the coin is fair. Therefore the conjecture of a fair coin is refutedby the experiment.

    The problem with this explanation is that we do not always reject the generator:there should be some sequence 𝛼 of thousand zeros and ones which is consistentwith this conjecture. Note, however, that the probability of obtaining the sequence𝛼 as a result of fair coin tossing is also 2−1000. So what is the reason behind ourcomplaints? What is the difference between the sequence of thousand zeros andthe sequence 𝛼?

    The reason is revealed when we compare Kolmogorov complexities of thesesequences.

    Proving theorems of probability theory. As an example, consider theStrong Law of Large Numbers. It claims that for almost all (according to the theuniform Bernoulli probability distribution) infinite binary sequences the limit offrequencies of 1s in their initial segments equals 1/2.

    More formally, let Ω be the set of all infinite sequences of zeros and ones.The uniform Bernoulli measure on Ω is defined as follows. For every finite binarystring 𝑥 consider the set Ω𝑥 consisting of all infinite sequences that start with 𝑥.For example, ΩΛ = Ω. The measure of Ω𝑥 is equal to 2

    −𝑙(𝑥). For example, themeasure of the set Ω01 that consists of all sequences starting with 01, equals 1/4.

    For each sequence 𝜔 = 𝜔0𝜔1𝜔2 . . . consider the limit of the frequencies of 1s inthe prefixes of 𝜔, that is,

    lim𝑛→∞

    𝜔0 + 𝜔1 + . . . + 𝜔𝑛−1𝑛

    We say that 𝜔 satisfies the Strong Law of Large Numbers (SLLN) if this limit existsand is equal to 1/2. For instance, the sequence 010101 . . . , having period 2, satisfiesthe SLLN and the sequence 011011011 . . ., having period 3, does not.

    The Strong Law of Large Numbers says that the set of sequences that do notsatisfy SLLN has measure 0. Recall that a set 𝐴 ⊂ Ω has measure 0 if for all 𝜀 > 0there is a sequence of strings 𝑥0, 𝑥1, 𝑥2, . . . such that

    𝐴 ⊂ Ω𝑥0 ∪ Ω𝑥1 ∪ Ω𝑥2 ∪ . . .and the sum of the series

    2−𝑙(𝑥0) + 2−𝑙(𝑥1) + 2−𝑙(𝑥2) + . . .

    (the sum of the measures of Ω𝑥𝑖) is less than 𝜀.One can prove SLLN using the notion of a Martin-Löf random sequence men-

    tioned above. The proof consists of two parts. First, we show that every Martin-Löfrandom sequence satisfies SLLN. This can be done using Levin–Schnorr random-ness criterion (if the limit does not exist or differs from 1/2, then the complexityof some prefix is less than it should be for a random sequence).

    The second part is rather general and does not depend on the specific law ofprobability theory. We prove that the set of all Martin-Löf non-random sequenceshas measure zero. This implies that the set of sequences that do not satisfy SLLNis included in a set of measure 0 and hence has measure 0 itself.

    The notion of a random sequence is philosophically interesting in its own right.In the beginning of XXth century Richard von Mises suggested to use this notion

  • 22 WHAT IS THIS BOOK ABOUT?

    (he called it in German “Kollektiv”) as a basis for probability theory (at that timethe measure theory approach was not developed yet). He considered the so-called“frequency stability” as a main property of random sequences. We will considervon Mises’ approach to the definition of a random sequence (and the subsequentdevelopments) in Chapter 9.

    Lower bounds for computational complexity. Kolmogorov complexityturned out to be a useful technical tool when proving lower bounds for computa-tional complexity. Let us explain the idea using the following model example.

    Consider the following problem: Initially a string 𝑥 of length 𝑛 is located inthe 𝑛 leftmost cells of the tape of a Turing machine. The machine has to copy 𝑥,that is, to get 𝑥𝑥 on the tape (the string 𝑥 is intact and its copy is appended) andhalt.

    Since the middle of 1960ies it is well known that (one-tape) Turing machineneeds time proportional to 𝑛2 to perform this task. More specifically, one can showthat for every Turing machine 𝑀 that can copying every string 𝑥 there exists some𝜀 > 0 such that for all 𝑛 there is a string 𝑥 of length 𝑛 whose copying requires atleast 𝜀𝑛2 steps.

    Consider the following intuitive argument supporting this claim. The numberof internal states of a Turing machine is a constant (depending on the machine).That is, the machine can keep in its memory only a finite number of bits. Thespeed of the head movement is also limited: one cell per step. Hence the rate ofinformation transfer (measured in bit·cell/step) is bounded by a constant dependingon the number of internal states. To copy a string 𝑥 of length 𝑛, we need to move 𝑛bits by 𝑛 cells to the right, therefore the number of steps should be proportionalto 𝑛2 (or more).

    Using Kolmogorov complexity, we can make this argument rigorous. A stringis hard to copy if it contains maximal amount of information, i.e., its complexity isclose to 𝑛. We consider this example in detail in Section 8.2 (p. 245).

    A combinatorial interpretation of Kolmogorov complexity. We con-sider here one example of this kind (see Chapter 10, p. 323, for more details).One can prove the following inequality for complexity of three strings and theircombinations:

    2𝐶(𝑥𝑦𝑧) 6 𝐶(𝑥𝑦) + 𝐶(𝑥𝑧) + 𝐶(𝑦𝑧) + 𝑂(log 𝑛)

    for all strings 𝑥, 𝑦, 𝑧 of length at most 𝑛.It turns out that this inequality has natural interpretations that are not related

    to complexity at all. In particular, it implies (see [65]) the following geometricalfact:

    Consider a body 𝐵 in a three-dimensional Euclidean space with coordinate axes𝑂𝑋, 𝑂𝑌 and 𝑂𝑍. Let 𝑉 be 𝐵’s volume. Consider 𝐵’s orthogonal projections ontocoordinate planes 𝑂𝑋𝑌 , 𝑂𝑋𝑍 and 𝑂𝑌 𝑍. Let 𝑆𝑥𝑦, 𝑆𝑥𝑧 and 𝑆𝑦𝑧 be the areas ofthese projections. Then

    𝑉 2 6 𝑆𝑥𝑦 · 𝑆𝑥𝑧 · 𝑆𝑦𝑧.Here is an algebraic corollary of the same inequality. For every group 𝐺 and

    its subgroups 𝑋, 𝑌 and 𝑍 we have

    |𝑋 ∩ 𝑌 ∩ 𝑍|2 > |𝑋 ∩ 𝑌 | · |𝑋 ∩ 𝑍| · |𝑌 ∩ 𝑍||𝐺|

    ,

    where |𝐻| denotes the number of elements in 𝐻.

  • SOME APPLICATIONS OF KOLMOGOROV COMPLEXITY 23

    Gödel incompleteness theorem. Following G. Chaitin, let us explain howto use Theorem 6 in order to prove the famous Gödel incompleteness theorem. Thistheorem states that not all true statements of a formal theory that is “rich enough”(the formal arithmetic and the axiomatic set theory are two examples of such atheory) are provable in the theory.

    Assume that for every string 𝑥 and every natural number 𝑛, one can expressthe statement 𝐶(𝑥) > 𝑛 as a formula in the language of our theory. (This statementsays that the chosen optimal decompressor 𝐷 does not output 𝑥 on any input oflength at most 𝑛; one can easily write this statement in the formal arithmetic andtherefore in the set theory.)

    Let us generate all the proofs (derivations) in our theory and select those ofthem who prove some statement of the form 𝐶(𝑥) > 𝑛 where 𝑥 is some stringand 𝑛 is some integer (statements of this type have no free variables). Once wehave found a new theorem of this type, we compare 𝑛 with all previously found 𝑛’s.If the new 𝑛 is greater than all previous 𝑛’s we write the new 𝑛 into the “recordstable” together with the corresponding 𝑥𝑛.

    There are two possibilities: either (1) the table will grow infinitely, or (2) thereis the last statement 𝐶(𝑋) > 𝑁 in the table which remains unbeaten forever. If (2)happens, there is an entire class of true statements that have no proof. Namely, alltrue statements of the form 𝐶(𝑥) > 𝑛 with 𝑛 > 𝑁 have no proofs. (Recall that byTheorem 5 there are infinitely many such statements.)

    In the first case we have infinite computable sequences of strings 𝑥0, 𝑥1, 𝑥2 . . .and numbers 𝑛0 < 𝑛1 < 𝑛2 < . . . such that all statements 𝐶(𝑥𝑖) > 𝑛𝑖 are provable.We assume that the theory proves only true statements, thus all the inequalities𝐶(𝑥𝑖) > 𝑛𝑖 are true. Without loss of generality we can assume that all 𝑥𝑖 arepairwise different (we can omit 𝑥𝑖 if there exists 𝑗 < 𝑖 such that 𝑥𝑗 = 𝑥𝑖; every stringcan occur only finitely many times in the sequence 𝑥0, 𝑥1, 𝑥2 . . . since 𝑛𝑖 → ∞ as𝑖 → ∞). The computable function 𝑘, defined by the equation 𝑘(𝑥𝑖) = 𝑛𝑖, is then anunbounded lower bound for Kolmogorov complexity. This contradicts Theorem 6.

  • Basic notions and notations

    This section is intended for people who are already familiar with some notionsof Kolmogorov complexity and algorithmic randomness theory and want to takea quick look at the terminology and notation used throughout this book. Otherreaders can (and probably should) skip it and look back only when needed.

    The set of all integer numbers is denoted by Z, the notation N refers to theset of all non-negative integers (i.e. natural numbers), R stands for the set of allreals. The set of all rational numbers is denoted by Q. Dyadic rationals are thoserationals having the form 𝑚/2𝑛 for some integer 𝑚 and 𝑛.

    The cardinality of a set 𝐴 is denoted by |𝐴|.When the base of the logarithmic function is omitted, it is assumed that the

    base equals 2, thus log 𝑥 means the same as log2 𝑥 (as usual, ln𝑥 denotes the naturallogarithm).

    We use the notation ⌊𝑥⌋ for the integer part of a real number 𝑥 (the largestinteger number that is less than or equal to 𝑥). Similarly, ⌈𝑥⌉ denotes the smallestinteger number that is larger than or equal to 𝑥.

    Order of magnitudes. The notation 𝑓 6 𝑔+𝑂(1), where 𝑓 and 𝑔 are expressionscontaining variables, means that for some 𝑐 the inequality 𝑓 6 𝑔 + 𝑐 holds for allvalues of variables. In a similar way we understand the expression 𝑓 6 𝑔 + 𝑂(ℎ)(where ℎ is non-negative): it means that for some 𝑐 for all values of variables theinequality 𝑓 6 𝑔 + 𝑐ℎ holds. The notation 𝑓 = 𝑔 + 𝑂(ℎ) (where ℎ in non-negative)means that for some 𝑐 for all values of variables we have |𝑓 −𝑔| 6 𝑐ℎ. In particular,𝑓 = 𝑂(ℎ) holds if |𝑓 | 6 𝑐ℎ for some constant 𝑐; the notation 𝑓 = Ω(ℎ) means that|𝑓 | > 𝑐ℎ for some constant 𝑐 > 0 (usually 𝑓 is positive). The notation 𝑓 = Θ(ℎ)means that 𝑐1ℎ 6 |𝑓 | 6 𝑐2ℎ (again, usually 𝑓 is positive).

    B denotes the set {0, 1}. Finite sequences of 0s and 1s are called binary strings.The set of all binary strings is denoted by Ξ. If 𝐴 is a finite set (an alphabet) then𝐴𝑛 denotes the set of all strings of length 𝑛 over the alphabet 𝐴, that is, the setof all sequences of length 𝑛, whose terms belong to 𝐴. We denote by 𝐴* the setof all strings over the alphabet 𝐴 (including the empty string Λ of length 0). Forinstance, Ξ = B*. The length of a string 𝑥 is denoted by 𝑙(𝑥). The notation 𝑎𝑏refers to the concatenation of strings 𝑎 and 𝑏, that is, the result of appending 𝑏to 𝑎. We say that a string 𝑎 is a prefix of a string 𝑏 if 𝑏 = 𝑎𝑥 for some string 𝑥. Wesay that 𝑎 is a suffix of a string 𝑏 if 𝑏 = 𝑥𝑎 for some string 𝑥. We say that 𝑎 is asubstring of 𝑏, if 𝑏 = 𝑥𝑎𝑦 for some strings 𝑥 and 𝑦 (in other words, 𝑎 is a suffix ofa prefix of 𝑏 or the other way around).

    We consider also infinite sequences of zeros and ones and Ω denotes the set ofall such sequences. The set of infinite sequences of elements of a set 𝐴 is denotedby 𝐴∞, thus Ω = B∞. For a finite sequence 𝑥 we use the notation Ω𝑥 for the setof all infinite sequences that start with 𝑥 (i.e., have 𝑥 as a prefix); sets of this form

    25

  • 26 BASIC NOTIONS AND NOTATIONS

    are called intervals. The concatenation 𝑥𝜔 of a finite sequence 𝑥 and an infinitesequence 𝜔 is defined in a natural way.

    In some contexts it is convenient to consider finite and infinite sequences to-gether. We use the notation Σ for the set of all finite and infinite sequences of zerosand ones, i.e., Σ = Ξ∪Ω, and Σ𝑥 denotes the set of all finite and infinite extensionsof a string 𝑥.

    We consider computable functions whose arguments and values are binarystrings. Unless it is stated otherwise functions are partial (not necessarily total).A function 𝑓 is called computable if there is a machine (a program, an algorithm)that for all 𝑥 such that 𝑓(𝑥) is defined halts on input 𝑥 and outputs the result 𝑓(𝑥)and does not halt on all inputs 𝑥 outside the domain of 𝑓 . We also consider com-putable functions whose arguments and values are finite objects of different type,like natural numbers, integer numbers, finite graphs etc. We assume that finiteobjects are encoded by binary strings. The choice of an encoding is not importantprovided different encodings can be translated to each other. The latter means thatwe can algorithmically decide whether a string is an encoding of an object and, ifthis is the case, we can find an encoding of the same object with respect to theother encoding.

    Sometimes we consider computable functions of infinite objects, like real num-bers or measures. Such considerations require rigorous definitions of the notion ofcomputability, which are provided when needed (see below).

    A set of finite objects (binary strings, natural numbers etc.) is called computablyenumerable, or just enumerable, if there is a machine (a program, an algorithm)without input that prints all elements from the set (and no other elements) witharbitrary delays between printing consecutive elements. The algorithm is not re-quired to halt even when the set is finite. The order in which the elements areprinted can be arbitrary.

    A real number 𝛼 is computable if there exists an algorithm that computes 𝛼with any given precision: for any given rational 𝜀 > 0 the algorithm must produce arational number at distance at most 𝜀 from 𝛼 (in this case we say that the algorithmcomputes the number). A real number 𝛼 is lower semicomputable if it can berepresented as a limit of a non-decreasing computable sequence of rational numbers.An equivalent definition: 𝛼 is lower semicomputable if the set of rational numbersthat are less than 𝛼 is enumerable. A sequence of real numbers is computable ifall its terms are computable, and given any 𝑛 we are able to find an algorithmcomputing the 𝑛th number in the sequence. The notion of a lower semicomputablesequence of reals is defined in an entirely similar way (for any given 𝑛 we have tofind an algorithm that lower semicomputes the 𝑛th number).

    We consider measures (more specifically, probability measures, or probabilitydistributions) on Ω. Every measure can be identified by its values on intervals Ω𝑥.So measures are identified with non-negative functions 𝑝 on strings which satisfythe following two conditions: 𝑝(Λ) = 1 and 𝑝(𝑥) = 𝑝(𝑥0) + 𝑝(𝑥1) for all 𝑥. Suchmeasures are called measures on the binary tree. We consider also semimeasureson the binary tree, which are probability measures on the space Σ of all finite andinfinite binary sequences. They correspond to functions 𝑝 such that 𝑝(Λ) = 1 and𝑝(𝑥) > 𝑝(𝑥0)+𝑝(𝑥1). We consider also semimeasures on natural numbers, which aredefined as sequences {𝑝𝑖} of non-negative reals with

    ∑︀𝑖∈N 𝑝𝑖 6 1. It is natural to

  • BASIC NOTIONS AND NOTATIONS 27

    identify such sequences with probability distributions on the set N⊥, which consistsof natural numbers and of the special symbol ⊥ (“undefined value”).

    Among all semimeasures (on the tree or on natural numbers) we distinguishlower semicomputable ones. Both the class of lower semicomputable semimeasureson the tree and the class of lower semicomputable semimeasures on natural numbershave a maximal semimeasure (up to a multiplicative constant). Any maximal lowersemicomputable semimeasure is called a priori probability (on the tree or on naturalnumbers). A priori probability of a natural number 𝑛 is denoted by 𝑚(𝑛); a prioriprobability of a node 𝑥 in the binary tree (that is, of the string 𝑥) is denoted by𝑎(𝑥). We use also the notation 𝑚(𝑥) for a binary string 𝑥, which means a prioryprobability of the number of 𝑥 with respect to some fixed computable one-to-onecorrespondence between strings and natural numbers.

    The plain Kolmogorov complexity is denoted by 𝐶(𝑥), the prefix Kolmogorovcomplexity is denoted by 𝐾(𝑥) (and by 𝐾 ′(𝑥) when we want to stress that we areusing prefix-free description modes). The same letters are used to denote complexi-ties of pairs, triples etc. and to denote conditional complexity. For instance, 𝐶(𝑥 |𝑦)stands for the plain conditional complexity of 𝑥 when 𝑦 is known, and 𝑚(𝑥, 𝑦 |𝑧)denotes a priori probability of the pair (𝑥, 𝑦) (that is, of the corresponding num-ber) when 𝑧 is known. The monotone Kolmogorov complexity is denoted by KM ,a priori complexity (negative logarithm of the a priory probability on the tree)is denoted by KA . (In the literature monotone complexity is sometimes denotedby Km and 𝐾𝑚 and a priori complexity is denoted by KM.) Finally the decisioncomplexity is denoted by KR .

    BB (𝑛) denotes the maximal halting time of the optimal decompressor on inputsof length at most 𝑛 (if the optimal prefix decompressor is meant, then we use thenotation BP (𝑛)). The function BB (𝑛) is closely related to the function 𝐵(𝑛)defined as the maximal natural number of Kolmogorov complexity at most 𝑛.

    We use also several topological notions. The space N⊥ consists of naturalnumbers and of a special element ⊥ (“undefined value”); the family of open setsconsists of the whole space and of all sets that do not contain ⊥. This topologicalspace, as well as the space Σ (where the family of open sets consists of all unions ofsets of the form Σ𝑥), is used for the general classification of complexities. For thespaces Ω and Σ and for the space of real numbers we call a set effectively open if itis a union of a computably enumerable family of intervals (sets of the form Σ𝑥 forthe second space and intervals with rational endpoints for the space of reals).

    Most notions of computability theory (including Kolmogorov complexity) canbe relativized, which means that all involved algorithms are supplied by an externalprocedure, called an oracle. That procedure can be asked whether any given numberbelongs to a set 𝐴. That set is also called an oracle. Thus we get the notions of“decidability relative to an oracle 𝐴”, “computability relative to 𝐴”, etc. In thecorresponding notations we use the superscript 𝐴, for example, 𝐶𝐴(𝑥).

    In the chapter on classical information theory, we use the notion of Shannon en-tropy of a random variable 𝜉. If the variable has 𝑘 possible outcomes and 𝑝1, . . . , 𝑝𝑘are their probabilities then its Shannon entropy 𝐻(𝜉) is defined as −

    ∑︀𝑘 𝑝𝑘 log 𝑝𝑘.

    This definition makes sense also for pairs of jointly distributed random variables.For such a pair the conditional entropy of a random variable 𝜉 when 𝜂 is knownis defined as 𝐻(𝜉, 𝜂) − 𝐻(𝜂). The difference 𝐻(𝜉) + 𝐻(𝜂) − 𝐻(𝜉, 𝜂) is called the

  • 28 BASIC NOTIONS AND NOTATIONS

    mutual information in random variables 𝜉 and 𝜂 and is denoted by 𝐼(𝜉 :𝜂). A sim-ilar notation 𝐼(𝑥 :𝑦) is used in the algorithmic information theory. As 𝐼(𝑥 :𝑦) iscommutative only up to a small error term, we usually say “the information in 𝑥about 𝑦” and define this notion as 𝐶(𝑦) − 𝐶(𝑦 |𝑥).

  • CHAPTER 1

    Plain Kolmogorov complexity

    1.1. The definition and main properties

    Let us recall the definition of Kolmogorov complexity from the Introduction.This version of complexity was defined by Kolmogorov in his seminal paper [77].In order to distinguish it from later versions we call it the plain Kolmogorov com-plexity. Later, starting from Chapter 4, we will consider also other versions ofKolmogorov complexity, including the prefix one and the monotone one, but fornow by Kolmogorov complexity we always mean the plain one.

    Recall that a description mode, or a decompressor , is a partial computablefunction 𝐷 from the set of all binary strings Ξ into Ξ. A partial function 𝐷 iscomputable if there is an algorithm that terminates and returns 𝐷(𝑥) on everyinput 𝑥 in the domain of 𝐷 and does not terminate on all other inputs. We saythat 𝑦 is a description of 𝑥 with respect to 𝐷 if 𝐷(𝑦) = 𝑥.

    The complexity of a string 𝑥 with respect to description mode 𝐷 is defined as

    𝐶𝐷(𝑥) = min{𝑙(𝑦) | 𝐷(𝑦) = 𝑥}.The minimum of the empty set is +∞.

    We say that a description mode 𝐷1 is not worse than a description mode 𝐷2if there is a constant 𝑐 such that 𝐶𝐷1(𝑥) 6 𝐶𝐷2(𝑥) + 𝑐 for all 𝑥 and write this as𝐶𝐷1(𝑥) 6 𝐶𝐷2(𝑥) + 𝑂(1).

    A description mode is called optimal if it is not worse than any other descriptionmode. By Kolmogorov–Solomonoff universality theorem (Theorem 1, p. 13) optimaldescription modes exist. Let us recall shortly its proof. Let 𝑈 be an interpreter ofa universal programming language, that is, 𝑈(𝑝, 𝑥) is the output of the program 𝑝on input 𝑥. We assume that programs and inputs are binary strings. Let

    𝐷(𝑝𝑥) = 𝑈(𝑝, 𝑥).

    Here 𝑝 ↦→ 𝑝 stands for any computable mapping having the following property:given 𝑝 we can effectively find 𝑝 and also the place where 𝑝 ends (in particular, if 𝑝is a prefix of 𝑞, then 𝑝 = 𝑞). This property implies that 𝐷 is well defined. For anydescription mode 𝐷′ let 𝑝 be a program of 𝐷′. Then

    𝐶𝐷′(𝑥) 6 𝐶𝐷(𝑥) + 𝑙(𝑝).

    Indeed, for every description 𝑦 of 𝑥 with respect to 𝐷′ the string 𝑝𝑦 is a descriptionof 𝑥 with respect to 𝐷.

    Fix any optimal description mode 𝐷 and let 𝐶(𝑥) (we drop the subscript)denote the complexity of 𝑥 with respect to 𝐷. (As we mentioned, in the first paperof Kolmogorov [77] the letter 𝐾 was used, while in his second paper [78] the letter𝐻 was used. We follow here the notation used by Li and Vitányi [102].)

    29

  • 30 1. PLAIN KOLMOGOROV COMPLEXITY

    As the optimal description mode is not worse than the identity function 𝑥 ↦→ 𝑥,we obtain the inequality 𝐶(𝑥) 6 𝑙(𝑥) + 𝑂(1) (Theorem 2, p. 15).

    Let 𝐴 be a partial computable function. Comparing the optimal descriptionmode 𝐷 with the description mode 𝑦 ↦→ 𝐴(𝐷(𝑦)), we conclude that

    𝐶(𝐴(𝑥)) 6 𝐶(𝑥) + 𝑂(1),

    showing the non-growth of complexity under algorithmic transformations (Theo-rem 3, p. 15).

    Using this inequality, we can define Kolmogorov complexity of other “finiteobjects” like natural numbers, graphs, permutations, finite sets of strings, etc.,that can be naturally encoded by binary strings.

    For example, let us define the complexity of natural numbers. A natural num-ber 𝑛 can be written in binary notation. Another way to represent a number by astring is as follows. Enumerate all the binary strings in the lexicographical order

    Λ, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, . . .

    using the natural numbers 0, 1, 2, 3, . . . as indexes. This enumeration is more con-venient compared to binary representation, as it is a bijection. Every string can beconsidered as an encoding of its index in this enumeration. Finally, one can alsoencode a natural number 𝑛 by a string consisting of 𝑛 ones.

    Using either of these three encodings we can define the complexity of 𝑛 asthe complexity of the string encoding 𝑛. Three resulting complexities of 𝑛 differat most by an additive constant. Indeed, for every pair of these encodings thereis an algorithm translating the first encoding into the second one. Applying thisalgorithm, we increase the complexity at most by a constant. Note that anywaythe Kolmogorov complexity of binary strings is defined up to an additive constant,so the choice of a computable encoding does not matter.

    As the length of the binary representation of a natural number 𝑛 is equal tolog 𝑛 + 𝑂(1), the Kolmogorov complexity of 𝑛 is at most log 𝑛 + 𝑂(1). (By log wedenote binary logarithms.)

    Here is another application of the non-growth of complexity under algorithmictransformations. Let us show that deleting the last bit of a string changes itscomplexity at most by a constant. Indeed, all three functions 𝑥 ↦→ 𝑥0, 𝑥 ↦→ 𝑥1,𝑥 ↦→ (𝑥 without the last bit) are computable.

    The same is true for the first bit. However this does not apply to every bit ofthe string. To show this, consider the string 𝑥 consisting of 2𝑛 zeros, its complexityis at most 𝐶(𝑛) +𝑂(1) 6 log 𝑛+𝑂(1). (By log we always mean binary logarithm.)There are 2𝑛 different strings obtained from 𝑥 by flipping one bit. At least one ofthem has complexity 𝑛 or more. (Recall that the number of strings of complexityless than 𝑛 does not exceed the number of descriptions of length less than 𝑛, whichis less than 2𝑛, Theorem 5, p. 18.)

    Incrementing a natural number 𝑛 by 1 changes 𝐶(𝑛) at most by a constant.This implies that 𝐶(𝑛) satisfies “Lipschitz property”: for some 𝑐 and for all 𝑚,𝑛we have |𝐶(𝑚) − 𝐶(𝑛)| 6 𝑐|𝑚− 𝑛|.

    1 Prove a stronger inequality: |𝐶(𝑚)−𝐶(𝑛)| 6 |𝑚−𝑛|+ 𝑐 for some 𝑐 and forall 𝑚,𝑛 ∈ N, and, moreover, |𝐶(𝑚)−𝐶(𝑛)| 6 2 log |𝑚−𝑛|+𝑐 (the latter inequalityassumes that 𝑚 ̸= 𝑛).

  • 1.1. THE DEFINITION AND MAIN PROPERTIES 31

    We have used several times the upper bound 2𝑛 for the number of strings 𝑥with 𝐶(𝑥) < 𝑛. Note that, in contrast to other bounds, it involves no constants.Nevertheless this bound has a hidden dependence on the choice of the optimaldescription mode: if we switch to another optimal description mode, the set ofstrings 𝑥 such that 𝐶(𝑥) < 𝑛 can change!

    2 Show that the number of strings of complexity less than 𝑛 is in the range[2𝑛−𝑐; 2𝑛] for some constant 𝑐 for all 𝑛. [Hint: the upper bound 2𝑛 is proved inIntroduction, the lower bound is implied by the inequality 𝐶(𝑥) 6 𝑙(𝑥) + 𝑐: thecomplexity of all the strings of length less than 𝑛− 𝑐 is less than 𝑛.]

    Show that the number of strings of complexity exactly 𝑛 does not exceed 2𝑛

    but can be much less: e.g., it is possible that this set is empty for infinitely many 𝑛.[Hint: Change an optimal description mode by adding 0 or 11 to each description,so that all descriptions have even length.]

    3 Prove that the average complexity of strings of length 𝑛 is equal to 𝑛+𝑂(1).[Hint: let 𝛼𝑘 denote the fraction of strings of complexity 𝑛 − 𝑘 among strings oflength 𝑛. Then the average compexity is by

    ∑︀𝑘 𝑘𝛼𝑘 less than 𝑛. Use the inequality

    𝛼𝑘 6 2−𝑘 and the convergence of the series∑︀

    𝑘/2𝑘.]

    In the next statement we establish a formal relation between upper bounds ofcomplexity and upper bounds of cardinality.

    Theorem 7. (a) The family of sets 𝑆𝑛 = {𝑥 | 𝐶(𝑥) < 𝑛} is enumerable and|𝑆𝑛| < 2𝑛 for all 𝑛. Here |𝑆𝑛| denotes the cardinality of 𝑆𝑛.

    (b) If 𝑉𝑛 (𝑛 = 0, 1, . . . ) is an enumerable family of sets of strings and |𝑉𝑛| < 2𝑛for all 𝑛, then there exists 𝑐 such that 𝐶(𝑥) < 𝑛 + 𝑐 for all 𝑛 and all 𝑥 ∈ 𝑉𝑛.

    In this theorem we use the notion of an enumerable family of sets. It is de-fined as follows. A set of strings (or natural numbers, or other finite objects) isenumerable (= computably enumerable = recursively enumerable) if there is an al-gorithm generating all elements of this set in some order. This means that thereis a program that never terminates and prints all the elements of the set in someorder. The intervals between printing elements can be arbitrarily large; if the set isfinite, the program can print nothing after some time (unknown to the observer).Repetitions are allowed but this does not matter since we can filter the output anddelete the elements that have already been printed.

    For example, the set of all 𝑛 such that the decimal expansion of√

    2 has exactly𝑛 consecutive nines is enumerable. The following algorithm generates the set: com-pute decimal digits of

    √2 starting with the most significant ones. Once a sequence

    of consecutive 𝑛 nines surrounded by non-nines is found, print 𝑛 and continue.A family of sets 𝑉𝑛 is called enumerable if the set of pairs {⟨𝑛, 𝑥⟩ | 𝑥 ∈ 𝑉𝑛}

    is enumerable. This implies that each of the sets 𝑉𝑛 is enumerable. Indeed, togenerate elements of the set 𝑉𝑛 for a fixed 𝑛 we run the algorithm enumerating theset {⟨𝑛, 𝑥⟩ | 𝑥 ∈ 𝑉𝑛} and print the second components of all the pairs that have 𝑛as the first component. However, the converse statement is not true. For instance,assume that 𝑉𝑛 is finite for every 𝑛. Then every 𝑉𝑛 is enumerable, but at the sametime it may happen that the set {⟨𝑛, 𝑥⟩ | 𝑥 ∈ 𝑉𝑛} is not enumerable (say 𝑉𝑛 = {0}if 𝑛 ∈ 𝑆 and 𝑉𝑛 = ∅ otherwise, where 𝑆 is any non-enumerable set of integers).One can verify that a family is enumerable if and only if there is an algorithm thatgiven any 𝑛 finds a program generating 𝑉𝑛. A detailed study of enumerable setscan be found in every textbook on computability theory, for instance, in [182].

  • 32 1. PLAIN KOLMOGOROV COMPLEXITY

    Proof. Let us prove the theorem. First, we need to show that the set

    {⟨𝑛, 𝑥⟩ | 𝑥 ∈ 𝑆𝑛} = {⟨𝑛, 𝑥⟩ | 𝐶(𝑥) < 𝑛},where 𝑛 is a natural number and 𝑥 is a binary string, is enumerable.

    Let 𝐷 be the optimal decompressor used in the definition of 𝐶. Perform inparallel the computations of 𝐷 on all the inputs. (Say, for 𝑘 = 1, 2, . . . we make 𝑘steps of 𝐷 on 𝑘 first inputs.) If we find that 𝐷 halts on some 𝑦 and returns 𝑥, thegenerating algorithm outputs the pair ⟨𝑙(𝑦) + 1, 𝑥⟩. Indeed, this implies that thecomplexity of 𝑥 is less than 𝑙(𝑦) + 1, as 𝑦 is a description of 𝑥. Also it outputs allthe pairs ⟨𝑙(𝑦) + 2, 𝑥⟩, ⟨𝑙(𝑦) + 3, 𝑥⟩ . . . in parallel to printing of other pairs.

    For those familiar with computability theory, this proof can be compressed toone line:

    𝐶(𝑥) < 𝑛 ⇔ ∃𝑦 (𝑙(𝑦) < 𝑛 ∧𝐷(𝑦) = 𝑥).(The set of pairs ⟨𝑥, 𝑦⟩ such that 𝐷(𝑦) = 𝑥 is enumerable, being the graph ofa computable function. The operations of intersection and projection preserveenumerability.)

    The converse implication is a bit harder. Assume that 𝑉𝑛 is an enumerablefamily of finite sets of strings and |𝑉𝑛| < 2𝑛. Fix an algorithm generating theset {⟨𝑛, 𝑥⟩ | 𝑥 ∈ 𝑉𝑛}. Consider the description mode 𝐷𝑉 that deals with stringsof length 𝑛 in the following way. Strings of length 𝑛 are used as descriptions ofstrings in 𝑉𝑛. More specifically, let 𝑥𝑘 be the 𝑘th string in 𝑉𝑛 in the order the pairs⟨𝑛, 𝑥⟩ appear while generating the set {⟨𝑛, 𝑥⟩ | 𝑥 ∈ 𝑉𝑛}. (We assume there are norepetitions, so 𝑥0, 𝑥1, 𝑥2 . . . are distinct.) Let 𝑦𝑘 be the 𝑘th string of length 𝑛 inthe lexicographical order. Then 𝑦𝑘 is a description of 𝑥𝑘, that is, 𝐷(𝑦𝑘) = 𝑥𝑘. As|𝑉𝑛| < 2𝑛, every string in 𝑉𝑛 gets a description of length 𝑛 with respect to 𝐷.

    We need to verify that the description mode 𝐷𝑉 defined in this way is com-putable. To compute 𝐷𝑉 (𝑦) we find the index 𝑘 of 𝑦 in the lexicographical orderingof strings of length 𝑙(𝑦). Then we run the algorithm generating pairs ⟨𝑛, 𝑥⟩ suchthat 𝑥 ∈ 𝑉𝑛 and wait until 𝑘 different pairs having the first component 𝑙(𝑦) appear.The second component of the last of them is 𝐷𝑉 (𝑦).

    By construction, for all 𝑥 ∈ 𝑉𝑛 we have 𝐶𝐷𝑉 (𝑥) 6 𝑛. Comparing 𝐷𝑉 with theoptimal description mode we see that there is a constant 𝑐 such that 𝐶(𝑥) < 𝑛 + 𝑐for all 𝑥 ∈ 𝑉𝑛. Theorem 7 is proven. �

    The intuitive meaning of Theorem 7 is as follows. The assertions “the numberof strings with certain property is small” (is less than 2𝑖) and “all the strings withcertain property are simple” (have complexity less than 𝑖) are equivalent providedthe property under consideration is enumerable and provided the complexity ismeasured up to an additive constant (and the number of elements is measured upto a multiplicative constant).

    Theorem 7 can be reformulated as follows. Let 𝑓(𝑥) be a function defined onall binary strings and taking as values natural numbers and a special value +∞.We call 𝑓 upper semicomputable, or enumerable from above, if there is a computablefunction ⟨𝑥, 𝑘⟩ ↦→ 𝐹 (𝑥, 𝑘) defined on all strings 𝑥 and all natural numbers 𝑘 suchthat

    𝐹 (𝑥, 0) > 𝐹 (𝑥, 1) > 𝐹 (𝑥, 2) > . . .

    and𝑓(𝑥) = lim

    𝑘→∞𝐹 (𝑥, 𝑘),

  • 1.1. THE DEFINITION AND MAIN PROPERTIES 33

    for all 𝑥. The values of 𝐹 are natural numbers as well as the special constant +∞.The requirements imply that for every 𝑘 the value 𝐹 (𝑥, 𝑘) is an upper bound of𝑓(𝑥). This upper bound becomes more precise as 𝑘 increases. For every 𝑥 thereis a 𝑘 for which this upper bound is tight. However, we do not know the value ofthat 𝑘. (If there is an algorithm that given any 𝑥 finds such 𝑘, then the function 𝑓is computable.) Evidently, any computable function is upper semicomputable.

    A function 𝑓 is upper semicomputable if and only if the set

    𝐺𝑓 = {⟨𝑥, 𝑛⟩ | 𝑓(𝑥) < 𝑛}is enumerable. This set is sometimes called the “upper graph of 𝑓”, which explainsthe strange names “upper semicomputable” and “enumerable from above”.

    Let us verify this. Assume that a function 𝑓 is upper semicomputable. Let𝐹 (𝑥, 𝑘) be the function from the definition of semicomputability. Then we have

    𝑓(𝑥) < 𝑛 ⇔ ∃𝑘 𝐹 (𝑥, 𝑘) < 𝑛.Thus, performing in parallel the computations of 𝐹 (𝑥, 𝑘) for all 𝑥 and 𝑘, we cangenerate all the pairs in the upper graph of 𝑓 .

    Assume now that the set 𝐺𝑓 is enumerable. Fix an algorithm enumerating thisset. Then define 𝐹 (𝑥, 𝑘) as the best upper bound of 𝑓 obtained after 𝑘 steps ofgenerating elements in 𝐺𝑓 . That is, 𝐹 (𝑥, 𝑘) is equal to the minimal 𝑛 such thatthe pair ⟨𝑥, 𝑛 + 1⟩ has been printed after 𝑘 steps. If there is no such pair, let𝐹 (𝑥, 𝑘) = +∞.

    Using the notion of an upper semicomputable function we can reformulateTheorem 7 as follows.

    Theorem 8. (a) The function 𝐶 is upper semicomputable and

    |{𝑥 | 𝐶(𝑥) < 𝑛}| < 2𝑛

    for all 𝑛.(b) If a function 𝐶 ′ is upper semicomputable and |{𝑥 | 𝐶 ′(𝑥) < 𝑛}| < 2𝑛 for

    all 𝑛, then 𝐶(𝑥) 6 𝐶 ′(𝑥) + 𝑐 for some 𝑐 and for all 𝑥.

    Note that the upper bound 2𝑛 of the cardinality of |{𝑥 | 𝐶 ′(𝑥) < 𝑛}| in item(b) can be replaced by a weaker upper bound 𝑂(2𝑛).

    Theorem 8 allows to define Kolmogorov complexity as a minimal (up to anadditive constant) upper semicomputable function 𝑘 that satisfies the inequality

    |{𝑥 | 𝑘(𝑥) < 𝑛}| = 𝑂(2𝑛).One can replace the requirement of minimality in this definition by some otherproperties of 𝐶. In this way we obtain the following “axiomatic” definition ofKolmogorov complexity [172]:

    Theorem 9. Let 𝑘 be a function defined on binary strings and taking naturalvalues. Assume that 𝑘 satisfies the following properties:

    (a) 𝑘 is upper semicomputable; [enumerability axiom](b) for every partial computable function 𝐴 from Ξ to Ξ the inequality

    𝑘(𝐴(𝑥)) 6 𝑘(𝑥) + 𝑐

    is valid for some 𝑐 and all 𝑥 in the domain of 𝐴; [complexity non-increase axiom](c) the number of strings 𝑥 such that 𝑘(𝑥) < 𝑛 is in the range [2𝑛−𝑐1 ; 2𝑛+𝑐2 ]

    for some 𝑐1, 𝑐2 and for any 𝑛. [calibration axiom]

  • 34 1. PLAIN KOLMOGOROV COMPLEXITY

    Then 𝑘(𝑥) = 𝐶(𝑥) + 𝑂(1), that is, the difference |𝑘(𝑥) −𝐶(𝑥)| is bounded by aconstant.

    Proof. Theorem 8 implies that 𝐶(𝑥) 6 𝑘(𝑥)+𝑂(1). So we need to prove that

    𝑘(𝑥) 6 𝐶(𝑥) + 𝑂(1).

    Lemma 1. There is a constant 𝑐 and a computable sequence of finite sets ofbinary strings

    𝑀0 ⊂ 𝑀1 ⊂ 𝑀2 ⊂ . . .with the following properties: the set 𝑀𝑖 has exactly 2

    𝑖 strings and 𝑘(𝑥) 6 𝑖+ 𝑐 forall 𝑥 ∈ 𝑀𝑖 and all 𝑖.

    Computability of 𝑀0,𝑀1,𝑀2, . . . means that there is an algorithm that givenany 𝑖 computes the list of elements of 𝑀𝑖.

    Proof. By axiom (c) there exists a constant 𝑐 such that for all 𝑖 the set

    𝐴𝑖 = {𝑥 | 𝑘(𝑥) < 𝑖 + 𝑐}has at least 2𝑖 elements. By item (a) the family 𝐴𝑖 is enumerable. Remove from𝐴𝑖 all the elements except 2

    𝑖 strings generated first. Let 𝐵𝑖 denote the resultingset. The list of the elements of 𝐵𝑖 can be found given 𝑖: we wait until the first 2

    𝑖

    strings are generated. The set 𝐵𝑖 is not necessarily included in 𝐵𝑖+1. To fix thiswe define 𝑀𝑖 inductively. We let 𝑀0 = 𝐵0, and we let 𝑀𝑖+1 be equal to 𝑀𝑖 plusany 2𝑖 elements of 𝐵𝑖+1 that are outside 𝑀𝑖. Lemma 1 is proven.

    Lemma 2. There is a constant 𝑐 such that 𝑘(𝑥) 6 𝑙(𝑥) + 𝑐 for all 𝑥 (recall that𝑙(𝑥) denotes the length of 𝑥).

    Proof. Let 𝑀0,𝑀1,𝑀2, . . . be th