Quality of Play in Chess and Methods for Measuring

Embed Size (px)

Citation preview

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    1/24

    Quality of play in chess andmethods for measuring

    Erik Varend 

    Tallinn, 2014

    Abstract. In this study, using the computer, the subject of the research isthe absolute strength of play of various chess playing entities (humans and

    computers). First of all, the actual accuracy of play will be determinedwhich is measured via the mean difference between the move suggested by

    the engine and the move actually made. Thereafter, individually for eachentity, factors that have an effect on the accuracy of play will be

    determined, and an estimated accuracy of play will be found based on those

    factors. It shows the accuracy if all factors were the same for all players.As a result, it was determined and proven that there is a relationship between rating and the uality of play. In addition, it was also proven that

    the further one goes bac! in time, the more the uality of play decreases."y comparing the accuracy of play in both humans# and engines# play it was

    determined to what e$tent %%&' and FI ratings correlate. The author also drew several miscellaneous conclusions based on the collected data.

    *. Introduction

    The primary aim of this study is to find a correlation between the strength of play (either FI and %%&') and the

    accuracy of play. Also + most noteworthy performances in the history of chess are under comparison, and how thestrength of chessplayers has changed over time. "esides, the final section of the paper contains various other conclusions that can be drawn from the data collected.

    There are different ways to estimate and compare performances-

     * by measuring absolute strength

    by measuring relative strength.

    Absolute strength can be defined by how far away a performance is compared to the perfect performance, i. e. thedistance between the actual performance and the best performance possible. The closer a performance stands to the

    absolute perfection, the better it is. In case of relative strength a performance is compared to results of other performers,and the actual strength has no importance at all. In circumstances where ascertaining the absolute strength has not been

    easily feasible, there has normally been no choice but to use relative strength measurement as a yardstic!. The latter is prevalent in case of one/on/one sports such as snoo!er, tennis and chess where '0 rating is used to compare the

    strength of players. %omparisons of players and performances from different epochs is only possible using absolutestrength. It is, for e$ample, impossible to say who was stronger, 'as!er or 1pass!y if using their chessmetrics ratings.

    %onseuently, we need to find an indicator of absolute strength in chess. There is a variety of ways to do this which can

     be split into two primary types-

    * tablebases  various computer/based estimations

    %ertainly, most preferred would be tablebases, because they give perfect solutions for each position. In that cas the

    accuracy of play would be measured in the mean number of transitions per move. A transition is a change in the state of 

    the game 2 won, drawn, or lost 2 assuming perfect play from both sides. A change from a drawn position to a lost one,or from a drawn one to a lost one euals to * transition. If a won position becomes a lost position, it is transitions. The

    fewer transitions per move, the higher the uality of play. Four piece tablebases were completed by the end of 34/s. 5/

     piece T"s were compiled in early nineties, those with 6 pieces in 445, and now we have +/piece tablebases. This

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    2/24

    implies that it#s uite hopeless to see the complete 7/piece tablebases in near future. That#s the reason why chess

    engines are necessary. There are many ways to describe the absolute accuracy with the help of the chess engines-

     * The average difference between the best move suggested by the engine and the move actually made *.* difference e$pressed in centipawns

    *. difference e$pressed in percentages  The average change in evaluation after the move made by the player

    7 The percentage of moves that coincide with those suggested by the engine8 The percentage of moves where the error e$ceeds a predetermined threshold

    The version *.* can be called the classical method, since it was used by 1lovenian researchers I. "rat!o and 9. :uid in

    their groundbrea!ing study.* The magnitude of an error is essentially the centipawn gap between the evaluation of amove suggested by the engine and a move actually made. 1maller differences indicate more accurate play.

    Another promising possibility is to use percentages instead of centipawns, i. e. similar to 9onte/%arlo method. The percentage indicates white#s scores against blac! after a move. To find the score, a computer is set to run a certain

    number of games against oneself. "etter scores would represent moves that are more preferable. The downside is the

    fact that it ta!es a lot of time to get a statistically valid number of moves, especially ta!ing into account the need for ensuring that the engine has enough time per move. 0therwise its useflness in more complicated positions becomes

    uestionable due to the hori;on effect. Its advantage primarily lies in theoretically drawn endgame positions whereevaluation/based estimations are !nown to be unreliable. ortuguese scientist . &. Ferreira has wor!ed out an interesting alternative solution, where what matters is not thegap to the best move at the same position, but between evaluations of best moves before and after a move has been

    made by a player. 'i!e the classical method, Ferreira#s method can be used with percentages.

    These tables below display differences in the classical and Ferreira methods in the cases of centipawns and percentages.

    move evaluation gap Move afterNe5

    evaluation change

    *7.?e5 4.78 *7...h6 4.4* /4.77

    *7.&c* 4.7* *7..."c8 4.48

    *7.e8 4.45 4.@ *7...d$c8 *.47

    *7.?c /4.* *7...f5 *.78

    *7.a7 /4.*8 *7..."$e5 *.34

    move percentage gap Move after  Rc1

    percentage change

    *7.&c* 55= *7...h6 5= /7=

    *7.?e5 54= *7..."c8 55=

    *7.e8 54= 5= *7...d$c8 @7=

    *7.?c 84= *7..."$e5 @+=

    *7.a7 7@= *7...f5 @@=

    The ways 7 and 8 are clearly inferior.The fact that a move made by a player coincides with that of made by computer may in most cases indeed indicate a

    good move.

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    3/24

    impact as one error of 4.34. The principal problem of the both methods lies in the fact that they are too coarse and do

    not describe the position of moves on the uality spectrum.

    owever, deterimining the absolute accuracy of play alone is not sufficient. A performance never happens in a vacuum,isolated from all factors acting upon it. The level of a performance can only be manifested by the co/influence of the

    two factors-

    * potential  conditions

     Potential  is the ability of a player to e$hibit an as high standard of performance as possible. It depends on a variety of 

    characteristics that differ for each sports. For instance, physical sports reuire good physiue, stamina and technicals!ills. In mental spords, such as chess and go, the reuired characteristics would include short/term memory, calculation

    speed, intuition etc.

    Conditions refers to a set of factors upon which the accuracy of play depends-

    * difficulty of positions  thin!ing time

     7 practical play 8 psychology

     5 conditions in the venue 6 health

     + level of fatigue

    The first three ones are the most important.In some positions it is easier to find a good move, whereas in other positions it is more difficult. That#s what the term

    #difficulty of positions# refers to. There are many ways a position can be difficult, it cannot be described by a singlefactor alone. It consists of many aspects, for e$ample, it may be chaotic and complicated, or there are relatively few

    good moves in a position, or good moves appear illogical at first sight etc. Also, difficulty is individual and variesamong different players- what for one player is difficult, may be easier for another. %omputers generally are able to find

    illogical moves with greater certaincy than humans.

    Thin!ing time is just a time control games are played under. 0ver time rate of play has gradually gotten increasinglyshorter.

    The notion #practical play# refers to the phenomenon where a player intentionally sacrifices the accuracy of play to ma!ematters more difficult for the opponent. The goal is to create such a situation where he would have to ma!e comparably

    more effort to maintain the same level of accuracy. There are 7 !inds of situations that can be perused in practical play.

    * difficulty of positions

      suitability of the type of positions 7 thin!ing time.

    1uitability of the type of positions indicates how much a certain type of positions suits a player and his nature whether 

    he is familiar with such type of positions, whether a given position needs more of calculating, !nowledge, intuiton etc.

    In the start position and usually at the beginning phase of the game all the three factors are even for either player. Theaim is to introduce imbalances into the game situation, in favour of the first player itself, so as to the opponent has moredifficult positions which also are less suitable for him. If the opponent is in time trouble, then moving faster so that he

    has less pondering time.

    >sychology plays an important role in chess. A chess player must have willingness to endure competitive stress. It isimportant that he has ability to remain calm in critical moments. 1ometimes it happens that a chess player allows

    himself to be disturbed by psychological factors, such as problems in private life, concerns over homeland or relativesand friends, that can affect concentrating on the game, and hinder going all out. The third type of psychological factors

    is directly connected to chess whether incompatibility with the style of an opponent, fear, or a feeling of uneasiness

    with him. >robably the most famous e$ample is 1hirov#s lifetime score against Dasparov 2 +- with no wins, which isfar more than one could e$pect from their ratings. Also, one may have gotten used to the style of a particular chess

     player to the e$tent that une$pected sudden changes in his play may confuse. Among these are cases where a player,

    who usually has preferred correct and objective play, suddenly sacrifices material. "elieve him or notE%onseuently, psychological factors can be bro!en in three main types-

    * factors arising from player#s characteristics

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    4/24

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    5/24

    today#s rating chessmetrics rating 654 corresponds to.

    To find a correlation between modern rating and the accuracy of play, @ cohorts at each *44 elo were analy;ed in the

    range *@44/+44. In each cohort the rating range was J$/5 $K5L, where $ signifies the goal rating of a particular cohort. The lowest number of moves was 844.

    And to find out how strongly engines play, 5 different chess engines from %%&' 84B84 rating list7. Ta!en into

    consideration were at least 54 moves by the following engines- iarcs *.* (@*), %rafty 7.4 (674), >hilou .3.4(76+), , with 65 seconds per move. The chess interfacewas Arena . The hardware was Intel i+ 364 M .34 :h;.

    0nly moves made in more/or/less even positions should generally be considered. If moves suggested by the engine andthose actually made on the board were both outside the range J.44 /.44L and with the same sign, then a position was

    considered as decisive, and moves were discarded.

    As a novelty, left out of consideration were moves that are very obvious. A move is considered as being too easy to spot

    if it meets the two criteria below starting from the first ply-

    • a move suggested by the engine remains the same• the gap between the two best moves is always *.44 or larger.

    0ne must have in mind that there is a boundary above which the magnitude of errors is irrelevant. For e$ample, if a

     player ma!es a move after which the evaluation drops from *.7 to /7.48, and another move with the evaluation drop of 

    from 4.3@ to /**.47, then there#s no basis for assuming that the former is objectively better than the latter.

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    6/24

    ach blue dot represents a position. The red line shows the linear correlation between evaluation and error, also called aslope. Nnli!e the factors of difficulty, the relationship between accuracy and evaluation does not depend on player#s

    nature of play. Therefore, if the evaluation were the same for all players, the e$cpected error would have to be derivedaccording to the same formula. "ut here a new problem arises- the average error varies among players, affecting the

    slope of the linear relationship. A slope indicates the degree of error change in relation to evaluation changes. To findthe relationship between the slope and the average error, *4 randomly pic!ed selections with *544 positions in each

    were selected. The graph below shows the result that can be ta!en as a basis.

    For e$ample, if a player has the average error of 4.*4, then, according to the formula, his slope would be*.5*O4.*K4.4 P 4.4. Increasing of the average evaluation by 4.* would cause the player#s average error to be inflated

     by 4.O4.* P 4.4.

    . ifficulty of positions

    This research uses different factors of difficuly-

    • the difference between the best and the second best moves, e$pressed in centipawn units• comple$ity

    The first one is self/e$planatory. The latter one needs some e$plaining. The manner of calculating comple$ity is ta!enfrom the wor! of "rat!o and :uid. very time the engine proposes a new Q* move, the gap between the best and and

    the second best moves is recorded, and at the end all these are summed together. ere it is presented in the form of original program code.

    complexity := 0

    FOR (depth 2 to 12)

    IF (depth > 2) {

    IF (previous_est_move !O" #$%&' curret_est_move) {

    complexity = *est_move_ev+lu+tio

    , secod_est_move_ev+lu+tio*

    -

    Graph 1: evaluation vs average error 

    0,0 0,2 0,4 0,6 0,8 1,0 1,2 1,4 1,6

    0,00

    0,06

    0,12

    0,18

    0,24

    0,30

    0,36

    0,42

    0,48

    0,54

    f(x ) = 0,29x - 0 Evaluation vs avg error 

    evaluation

      e  r  r  o

      r

    Graph 2: eval vs avg error slope depending on the average error 

    0,060 0,080 0,100 0,120 0,140 0,160 0,180 0,200 0,220 0,240 0,260

    0

    0,050,1

    0,15

    0,2

    0,25

    0,3

    0,35

    0,4

    0,45

    f(x) = 1,51x 0,02

    !" = 0,82

    avg error vs eval slo#e

    avg error 

      e  v  a   l  s   l  o  #  e

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    7/24

    -

    previous_est_move := curret_est_move

    -

    In this study a modified version is used. Through depths *4/*5 plies all values are doubled to assign them more

    importance. It#s always harder to see any changes in greater depths and indicates a more complicated position."elow is a comparative e$ample how computing comple$ity scores is carried out in both ways. ighlighted with yellow

    are cases where the best move changes. The sum at the lowest row shows the degree of comple$ity.

    move evaluation difference depth move evaluation difference

     ?c6 /4.7 4.44 ?c6 /4.7 4.44

     ?f6 4.58 4.47 7 ?f6 4.58 4.47

     ?f6 /4.* 4.44 8 ?f6 /4,* 4.44

    d6 4.85 4.*6 5 d6 4.85 4.*6

     ?c6 /4.*4 4.44 6 ?c6 /4.*4 4.44

     ?c6 4.74 4.44 + ?c6 4.74 4.44

     ?f6 /4.47 4.4 3 ?f6 /4.47 4.4

     ?f6 4.@ 4.44 @ ?f6 4.@ 4.44

     ?f6 4.* 4.44 *4 ?f6 4.* 4.44

     ?f6 4.*@ 4.44 ** ?f6 4.*@ 4.44

     ?f6 4.45 4.44 * ?f6 4.45 4.44

    e5 4.@ 4.*+ *7 e5 4.@ 4.*+ ($)

    e5 4.* 4.48 *8 e5 4.* 4.48

    e5 4.6 4.*8 *5 e5 4.6 4.*8

    sum 0.38 sum 0.55

    The two graphs below illustrate how both factors of difficulty influence the accuracy of play. All analy;ed positions are

    included.

    Influence of a factor of difficulty on a player is individual, and is dependent on one#s nature of play. 1ome players arerelatively more susceptible to changing difficulty of positions. Their accuracy becomes worse faster than other players

    with increasing difficulty. As the difficulty of positions cannot be described by one parameter only, it remains possiblethat different factors have different effect on a player. For e$ample, in the instance of two eually strong players, one of 

    them may have a lower than average tolerance for the factor represented by #comple$ity# in this study, and a higher thanaverage tolerance for #difference# but completely the other way around for another one.

    .7 Thin!ing time

    There is plenty of information in the Internet about various time controls that have been used in various eventsthroughout history. Nnfortunately it is not always possible to find any information about in a certain event. In such cases

    the following principle was applied- *334 / *@5 8 min *@6 / *@85 7 min 4 s *@86 / *@35 7 min 85 s *@36 / ... 7min per move.

    D.

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    8/24

    K *4##Bmove / 744 elo.8 According to that, the double difference in thin!ing time is eual to ** elo, and the relationship

     between them is logarithmic. For engines the difference is worth 66 elo.

    The biggest concern in games of earlier times is adjourned games. There#s no doubt that a possibility to analy;e games

    either alone or with assistants greatly helps the accuracy of moves played after resuming the game. It would benecessary to !now how long those sessions lasted before resuming the play, and whether analy;ing was allowed. As in

    the case of time controls, information is rather scarce. In the absence of reliable information, * hour was added to timecontrol of each game that underwent adjournements as a compensation.

    1ometimes the remaining number of moves after 84 thB64th move has not been specified in time control information,

    e$cept the number of minutes. In such cases the remaining time amount was divided by the number of moves actually played. If it e$ceeds time per move specified in the first part of time control, then the average thin!ing time in a given

     phase of the game is considered the same as in the preceding phase.

    .8 >ractical play

    0f the three possible manifestations of practical play, only the difficulty of positions is loo!ed at here. Ideally, it wouldhave been preferable to use the suitability of position types and thin!ing time as well. "ut in the first case it would have

     been necessary to devise a way to uantify the suitability of the types of positions for players. In the latter case the!nowledge on the precise amount of time spent on thin!ing on each move would have been needed. "oth areunreali;able at the current juncture. The method in itself is simple 2 measure and compare the difficulty of positions for 

    either side of the board. If one side has positions that are easier to play, it may be assumed that its results are better thanits accuracy of play would suggest. The effect of difficulty difference between either player depends on two factors-

    a) degree of the difference between the difficulty of positions

     b) sensitivity of a player#s accuracy of play to difficulty

    :enerally there#s no data available on the tolerance of particular players with respect to changing difficulty level. Insuch cases it#s possible to use generali;ed sensitivity to both factors of difficulty of positions and which is dependent on

    the average error. In order to find this, first we ta!e the average rating of all opponents and loo! up its euivalentaverage e$pected error in the error2rating table. 1econdly, we determine the relationship between average error and

    slopes for both types of difficulty, as shown in the graphs below. 1imilary to the graph , each data point represents arandomly/selected dataset of *544 positions.

    8 http-BBwww.chessgames.comBperlBchess.plEtidP34@34H!pageP4Qreply575

    Graph 6: complexity slope depending on avg error 

    0,060 0,080 0,100 0,120 0,140 0,160 0,180 0,200 0,220 0,240 0,260

    0

    0,05

    0,1

    0,15

    0,2

    0,25

    0,3

    0,35

    f(x) = 0,61 x+0,93'o%#lexit& an* s lo#e

    avg error 

          s        l      o      #      e

    Graph 7: difference slope depending on avg error 

    0,060 0,080 0,100 0,120 0,140 0,160 0,180 0,200 0,220 0,240 0,260

    0,00

    0,10

    0,20

    0,30

    0,40

    0,50

    0,60

    f(x) = 0,24 ln(x) 0,9*ifferen'e an* slo#e

    avg error 

      s   l  o  #  e

    Graph : !ependence of performance on thin"ing time

    0,00 0,20 0,40 0,60 0,80 1,00 1,20 1,40 1,60 1,80 2,00

    -600

    -525

    -450

    -35

    -300

    -225

    -150

    -5

    0

    5

    150Tining ti%e vs elo

    ti%e 'oeffi'ient

      e   l  o  #  e  r   f  o  r  %  a  n

      '  e

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    9/24

    1lope increases with average error. ence, if our opponent had an average e$pected error of 4.45, its comple$ity vs

    average error slope would be 4.6*O4.454.@7P4.*8$ and difference vs average error slope 4.8Oln(4.45)[email protected]*$.It#s not necessary to include practical play if both sides of games are ta!en into analysis. In that case differences in

    difficulty, suitability, thin!ing time etc would cancel each other out. If a game for one player is on average moredifficult by $ hypothetical units, then his opponent has, at the same time, the game easier by /$ units, and the sum would

    always be ;ero.For this reason, practical play has only been included in the analysis of the games of the + most remar!able

     performances in the history of chess. As for the rest of games, both sides have been ta!en into account.

    .5 Finding the strength of play

    aving determined the absolute accuracy of play and the aforementioned factors having effect on it, it becomes possible

    to derive the e$pected error of players. It consists of the following steps-

    *. Find the average e$pected error of players.. stablish a relation between a modern rating and the e$pected error.

    7. Find the modern rating euivalent of the e$pected error.8. Find out rating lossesBgains due to time control an practical play.

    As a result we will get a supposed today#s rating corresponding to the strength of play. Nnfortunately, one must be

    satisfied with the fact that full confidence can never be attained. 9ethods described here are by no means *44=reliable, as it#s still in its infancy and chess engines of today have limited abilities.

    The e$pected error indicates a player#s hypothetical accuracy of play (average error), if the difficulty of positions andevaluation were e$actly the same for all players. In this study the average comple$ity of all moves valid for comparison

     2 4.53 and difference 2 4.57 were used to represent a common ground. The graph below showing how the accuracy

    of %apablanca and Dramni! changes as a function of comple$ity also depicts the manner the e$pected error isdetermined with the help of linear trend lines.

    As we can see, %apablanca#s e$pected error by comple$ity is 4.4+4, and that of Dramni! is 4.438. Dramni!#s positions

    were a little more complicated and those of %apablanca was far lower than the average comple$ity of all positions,

    therefore the gap between their accuracies of play would be smaller if they both had positions of the same comple$ity.The e$pected error according to the difference is found by the same method. 0ne can also note that %apablanca#s

    accuracy of play has been less dependent on difficulty than Dramni!.

    Graph #: $inding expected error 

    0,00 0,05 0,10 0,15 0,20 0,25 0,30 0,35 0,40 0,45 0,50 0,55 0,60 0,65 0,0 0,5 0,80 0,85 0,90 0,95 1,00

    0,00

    0,02

    0,04

    0,06

    0,08

    0,10

    0,12

    0,14

    0,16

    0,18

    0,20

    f(x) = 0,15x 0,01

    !" = 0,11

    f(x) = 0,06x 0,04

    !" = 0,03

    $a#a.lan'a /e or 192

    ra%ni vs as#arov %at' 2000

    'o%#lexit&

      e  r  r  o  r

    averagecomple$ity of%apablanca#s positions

    average

    comple$ity of!ramni!#s

     positions

    averagecomple$ity ofall positions

    actual averageerror of

    %apablanca#smoves

    actual average

    error ofDramni!#s

    moves

    e$pected errorof both player#s moves

    change inDramni!#saccuracy dueto changingdifficulty

    change in%apablanca#saccuracy dueto changingdifficulty

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    10/24

    7. &esults

    The following section is divided in two parts. First all necessary data on all analy;ed chess/playing entities will be dealt

    with, and then, step/by/step based on that, we#ll find the hypothetical strength of play of each player.The most important of these is, of course, the actual accuracy of play i.e. the average error. The result of a game only

    depends on differences in the accuracy of play. owever, it must be born in mind that it never directly shows the levelof chess s!ills, but rather remains biased towards players with more positional style and longer time controls. The

    following graph displays all chess/playing entities sorted by average error.

    $pectedly, most engines occupy top spots.

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    11/24

    It stands out that %apablanca had positions with by far least difficulty. There has been a lot of mentioning on Fischer#s

    simple style of play, his tendency to avoid complications. Indeed, according to the graph **, the comple$ity of his

     positions were below average in games against 'arsen and Taimanov. owever, it can be seen on the gaph *4 that theaverage difference between two best moves in Fischer#s positions is above average. The fact that a position seems

    somewhat easy to us does not automatically mean it would be easy to find accurate moves. It is perhaps not uitesurprising that correspondence games from chessgames.com have the lowest average evaluation, i. e. in those games the

     positions were eual longer due to higher uality of play.

    The graph above shows the average e$pected error which is derived by ta!ing the average of both e$pected errors bycomple$ity and difference and includes changes in the average error due to the evaluation. The results are more logical,

    compared to what was dispalyed on the graph @. %orrespondence games are left out, as there is no point in measuring

    changes in the accuracy of play, if its estimation cannot be trusted. As a rule in all !inds of measurements, the gauge

    Graph 12: &verage evaluation of positions

    $$ga%es$raft&

    1910s2600

    7iar's1940s180s1860s1920s

    $a#a.lan'a190s1990silou

    i'ro-axar#ov1900s1960s

    ra%nira#i* 200

    as#arov2200

    1930s1880saser 2000s

    24001900

    is'er $arlsen

    1980s.lit 200

    1950s1890s

    210025002000

    ax%an2002300

    0,000 0,200 0,400 0,600 0,800 1,000

    0,18 0,356

    0,380,3900,399

    0,4200,4250,4250,4280,4380,4450,4480,452

    0,410,5020,5040,5040,5100,5190,5260,5320,530,5430,558

    0,540,5850,585

    0,6120,615

    0,6420,6500,6620,6630,610,69

    0,100,84

    0,8030,881

    evaluation

    Graph 13: &verage expected error 

    .lit 200

    1900

    2000

    2200

    1860s

    2100

    1920s

    1890s

    2500

    1910s

    2300

    2600

    180s

    2400

    1900s

    1880s

    ra#i* 200

    1960s

    i'ro-ax

    1950s

    1990s

    aser 

    1940s

    2000s

    1930s

    1980s

    190s

    ax%an

    200

    as#arov

    ilou

    ra%ni

    $a#a.lan'a

    ar#ov

    $arlsen

    $raft&

    is'er 

    7iar's

    0,000 0,050 0,100 0,150 0,200 0,250 0,300 0,350

    0,293

    0,258

    0,243

    0,229

    0,22

    0,22

    0,211

    0,206

    0,200

    0,193

    0,193

    0,14

    0,168

    0,16

    0,16

    0,166

    0,161

    0,156

    0,155

    0,153

    0,151

    0,142

    0,138

    0,134

    0,132

    0,129

    0,11

    0,116

    0,112

    0,092

    0,091

    0,091

    0,03

    0,02

    0,065

    0,064

    0,054

    0,052

     :verage ex#e'te* error 

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    12/24

    must be of higher uality or trustworthiness than things being measured. The methods used in this paper are simply not

    adeuate enough for modern software/assisted correspondence games.

    The ne$t step is to ta!e data from the previous graph to find the relationship between the rating and the uality of play.

    The relationship appears to be logarithmic. The blac! line depicts the appro$imate boundary of trustability below which

    engine output cannot be trusted. It is interesting to note that it crosses the trend line at @7* '0, which may indicatethat the level of play of the combination of the engine, hardware and time used here is eual to @7* FI 443. "ut

    that is naturally a speculation which needs further research.

    >layers ran!ed according to thin!ing time-

    The farther bac! in time, the longer time controls are.

    The ne$t steps represent an attempt to factor in at least a fraction of generally unfathomable and messy notion called

     practical play.

    Graph 14: )he relationship (et*een accuracy and fide rating 2%%#

       3  0

      0  0

       2   9

      0  0

       2   8

      0  0

       2   (

      0  0

       2  6

      0  0

       2   5

      0  0

       2  4

      0  0

       2   3

      0  0

       2   2

      0  0

       2  1

      0  0

       2  0

      0  0

      1   9

      0  0

      1   8

      0  0

      1   (

      0  0

      1  6

      0  0

    0

    0,02

    0,04

    0,06

    0,08

    0,1

    0,12

    0,14

    0,16

    0,18

    0,2

    0,22

    0,24

    0,26

    0,28

    0,3

    f(x) = 0,11 ln(x) - 0,03

    !" = 0,86

    Te a''ura'& of #la& an* ;E rating 2008

    E< rating

      e  x  a  v  e  r  a  g  e 

      #  e  '   t  e   * 

      e  r  r  o  r

    Graph 1: )hin"ing time

    .lit 200

    ra#i* 200

    ax%an

    $raft&

    7iar's

    ilou

    i'ro-ax

    $arlsen

    1900

    2300

    2000

    2500

    2000s

    2600

    21002400

    2200

    200

    1990s

    ar#ov

    as#arov

    ra%ni

    1930s

    1980s

    1940s

    1950s

    190s

    1910s

    $a#a.lan'a

    is'er 

    1960s

    180s

    1890s

    1880s

    1900s

    aser 

    1920s

    1860s

    0 50 100 150 200 250 300 350 400

    5

    24

    60

    60

    60

    60

    60

    134

    143

    152

    153

    156

    158

    160

    161164

    165

    10

    180

    180

    180

    180

    200

    203

    21

    225

    225

    240

    244

    249

    250

    259

    264

    266

    26

    280

    289

    341

    Ti%e 'ontrol

    "oundary of trustabilityP 4,47*

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    13/24

    >layers ran!ed according to relative difference of positions-

     ?egative value shows that opponents had positions easier, in the case of positive one it is the other way around. As onecould have e$pected, Dasparov and 'as!er, players !nown for practical play, are situated on top. 1omewhat surpisingly,

    it appears that even %apablanca too had both difficulty factors easier than his opponents. 0ne of reasons could be thatthe easier one#s positions, the greater the probability that the opponent#s positions are more difficult, despite the degree

     practicality in one#s play. Dramni!#s positions were e$pectedly easier than Dasparov#s in the title match 444.

    "efore trying to find out how much e$actly a difficulty differential influences opponent#s play, it is necessary to !nowthe strength of opponents. First we loo! up FI or chessmetrics rating of that time and translate it into contemporary

    rating euvalent.

    The blue line represents actual data based on the analysis of randomly pic!ed games (rating range 644/+44) from

    each decade. The red line represents top/rated players# strength of play. The gap in each decade between a top/rated player and a 654/rated player is based on the arithmetical averages of january lists in the same decade. It can be seen

    that if the logarithmic trend line can be trusted, the first time top players reached an F9 level (744 '0) already inmid/*@th century. The level of an International 9aster (844/544) was achieved in *334/*3@4s. :9 level was reached

    during the first decades of the RR century. evelopment was relatively uic!/paced at that time. Top players were ontoday#s 1uper :9 level already in the 84/ies. arsen 191

    $a#a.lan'a /e or 1924

    ar#ov inares 1994

    aser /e or 192

    as#arov inares 1999

    -0,100 -0,050 0,000 0,050 0,100

    !elative *iffi'ult& of #ositions

    'o%#lexit& *iff eren'e

    Graph 17: evolution of chess strength (y decades

    1830s 1840s 1850s 1860s 180s 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 190s 1980s 1990s 2000s 2010s 2020s 2030s

    1800

    1900

    2000

    2100

    2200

    2300

    2400

    2500

    2600

    200

    2800

    2900

    3000

    !ise of 'ess sills over ti%e

    elo 'ess%etr i's 2650 ogar it%i' (elo 'ess%etr i's 2650)

    elo 'ess%etri's igest ogarit%i' (elo 'ess%etri's igest)

    *e'a*es

       5   ;   )   E   2   0   0   8

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    14/24

    The graph below demonstrates the opponent ratings and their actual strengths.

     ?ot surprisingly, Dramni! had the strongest opponent against Dasparov in 444. The wea!est opposition was against%apablanca in ?ew or! *@8. "y translating those ratings into average e$pected error and ta!ing into account how

    generali;ed sensitivity, as described in the section .8., to either type of difficulty depends on it, it can be ascertainedhow much difficulty differentials affect performance.

    The ne$t graph shows the final conclusion of this wor!. The blue bars indicate ratings directly derived from the

    e$pected error, as shown on the graph *8. arsen 191

    ar#ov inares 1994

    as#arov inares 1999

    $arlsen /aning 2009

    ra%ni vs as#arov 2000

    2200 2300 2400 2500 2600 200 2800 2900 3000

    246

    2533

    2690

    224

    263

    285

    288

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    15/24

    The winners according to this criterion are iarcs *.* and %arlsen in ?anjing 44@. It may seem surprising that his

    actual accuracy of play in that tournament was almost *44 points lower than his official T>& (744*). "ut it can bee$plained by two facts- against &s. It is hardly a surprise that at the bottom of the

    graph are situated those who should be there 2 *@44/rated players and blit; games of +44/rated players.

    8. 9iscellaneousIn this section several additional interesting conclusions that the collected data offers will be provided.

    8.* %hessmetrics 7/year pea! top 54

    The table below compares 7/year pea! ratings for each player ta!en from the chessmetrics.com site and their FI

    euivalents in 443. The year indicates the middle year of the three/year periods.

    According to that table, the strongest level of play of all times was performed by Dasparov during *@3@/*@@* where his

     play supposedly would have been rated circa 364 in 443. et, it must be ta!en into account that this table is a bit

    name year chessmetrics FIDE 2008

    1 as#arov 1990 284 2861

    2 is'er 192 286 281

    $a#a.lan'a 1920 285 2664

    ! aser 1895 2855 2562

    " ?otvinni 1946 2852 238# :leine 1931 2841 2684

    $ ar#ov 1989 2833 2819

    8 :nan* 1998 2822 2825

    % ra%ni 2001 2815 2824

    10ills.ur& 1901 2806 2540

    11aro'& 1906 299 2554

    12or'noi 199 298 263

    1Tarras' 1895 296 2503

    1! ;van'u 1992 294 285

    1"@teinit 1885 294 2451

    1#@%&slov 1955 293 203

    1$etrosian 1962 289 216

    18Tal 1960 286 208

    1%!u.instein 1912 281 2559

    20!esevs& 1953 26 268121/a*orf 194 25 2664

    22Auertort 1884 24 2425

    2eres 1956 23 2685

    2!/i%oits' 1929 20 260

    2"?ronstein 1951 20 2669

    2#@#ass& 190 26 212

    2$a%s& 1995 265 262

    28$igorin 1896 263 245

    2%arsall 191 259 2556

    0eo 2001 25 266

    1Banos& 1904 25 2504

    2ine 1940 256 2626

    To#alov 199 254 255

    !@alov 1994 254 249

    "Celfan* 1992 254 245#@irov 2000 253 260

    $?ogolu.o 192 253 2583

    8Celler 1963 252 2681

    %oroevi' 2000 251 258

    !0Eue 1936 250 2608

    !1 :*a%s 2001 249 258

    !2olugaevs& 19 248 209

    !?eliavs& 1988 24 231

    !!Ti%%an 1989 24 233

    !"@'le'ter 1911 24 2522

    !#ortis' 1980 246 213

    !$@tein 1966 245 2681

    !8Daganian 1985 244 221

    !%Bussu#o 198 244 225

    "0arsen 190 244 2689

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    16/24

    misleading. It is often so that the chessmetrics rating of a player is a decade or more later only a few points below his

     pea!, but nevertheless his play is better due to general rise in chess s!ills. For e$ample, 'as!er#s chessmetrics rating in*3@8 was 3+3, but in *@*+ it was 364, whose 443 euivalent would have been ca 654. Dasparov#s rating in *@@@

    was 338, merely p lower than his best he achieved in *@@7, but there is no doubt that his actual uality of play hadimproved by that time.

    8. %omparison of human and engine ratings

    In this wor! there is enough data on both humans and engines for ma!ing interesting comparisons between so differenttypes of players. "elow is a side by side comparison of the relationships between the accuracy of play and ratings of 

    either type. %%&' ratings are given as of *+.*4.4*8. According to the site, the time control was chosen in such a wayas to be euivalent of 84 moves per 84 minutes on Athlon 68 R 8644K (.8 :h;).

    As we can observe, the trend lines are of opposite nature. The relationship between the accuracy of play of human chess players and the rating is logarithmic on lower levels, the accuracy gaps are smaller than at the top. 0n the other hand,

    in the case of engines, it is completely opposite 2 e$ponential. It should be noticed how closely the trend line followsthe actual line representing the accuracy of play there is a star! contrast. It confirms what was !nown for long 2 

    computers# play is far more stable.

    "ased on data on those two graphs, it is possible to compile conversion tables for finding one/on/one correspondences

     between both rating systems.

    Graph 2%: $/!. rating vs accuracy

       3  0  0  0

       2   8  0  0

       2  6  0  0

       2  4  0  0

       2   2  0  0

       2  0  0  0

      1   8  0  0

      1  6  0  0

    0

    0,04

    0,08

    0,12

    0,16

    0,2

    0,24

    0,28

    Te a''ura'& of #la& an* ;E rating 2008

    E< rating

      e  x  #  e  '   t  e   *  e  r  r  o  r

    Graph 21: CC+- rating vs accuracy

      1   5   0   0

      1   (   0   0

      1   9   0   0

       2  1   0   0

       2   3   0   0

       2   5   0   0

       2   (   0   0

       2   9   0   0

       3  1   0   0

       3   3   0   0

    0

    0,04

    0,08

    0,12

    0,16

    0,2

    0,24

    0,28

    Te a''ura'& of #la& an* $$! rating 2014

    $$! 4040 rating

      e  x  #  e  '   t  e   *

      e  r  r  o  r

    &&R' !0(!0 FIDE 2008 !0(%0)0

    3400 2926

    3300 2921

    3200 2916

    3100 2911

    3000 2904

    2900 289

    2800 2888

    200 28

    2600 2864

    2500 2849

    2400 2830

    2300 2805

    2200 24

    2100 234

    2000 269

    1900 26031800 2493

    100 2325

    1600 2054

    FIDE 2008 !0(%0)0 &&R' !0(!0

    2900 2941

    2850 250

    2800 2281

    250 2136

    200 2034

    2650 195

    2600 1896

    2550 184

    2500 1805

    2450 10

    2400 139

    2350 112

    2300 1688

    2250 1662200 164

    2150 1630

    2100 1614

    2050 1599

    2000 1586

    Graph 22: human and engine rating comparison

    2000

    2100

    2200

    2300

    2400

    2500

    2600

    200

    2800

    2900

    3000

    relationsi# .eteen ;E an* $$! ratings

    $$! 2014 4040

       5   ;   )   E

       2   0   0   8   4   0      9   0      3   0

    Graph 23: engine and human rating comparison

    1300

    1500

    100

    1900

    2100

    2300

    2500

    200

    2900

    3100

    relationsi# .eteen $$! an* ;E ratings

    ;E 2008 409030

       $   $   !      2   0   1   4   4   0      4   0

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    17/24

    At first sight, it may seem surprising that the best chess engines are, according to this, so wea! compared to humans.

    "ut, it must be ta!en into consideration that %%&' games are run on a uite wea! hardware and the rate of play isnearly 7$ uic!er than the standard FI time control. It can be concluded from the graph that in the beginning it

    was uite easy for engines to ma!e progress against humans, but with time it is getting increasingly harder. ?ote-comparisons were made on the assumption that humans play against engines as they would against other humans i. e.

    not using any anti/computer strategies. Nnfortunately there is not yet a reliable way to emulate anti/computer play andits effects.

    The reasons why the relationships between the accuracy and strength of play are e$actly li!e that, are un!nown to the

    author. 0ne of feasible reasons could be that the nature of the curve is related to the relative importance of calculation/evaluation. The larger the relative importance of calculation in the move/choosing process (engines), the steeper the

    e$ponential curve is while gaining in rating points, there is an ever/decreasing rate in the accuracy gain. "ut, if a player has a larger relative importance of evaluation (humans), then there is a contrary phenomenon- rating growth means a

    faster increase in the accuracy. If it is true, then there presumably must be such a hypothetical mutual relationship

     between calculation and evaluation where the accuracy vs rating relationship is linear.

    8.7 &ating inflation and deflation

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    18/24

    0n the first graph we can see that the gap has decreased from 64 points in *@+4 to 7+ points in 4*8 5 points per decade. %learly recogni;able are two #mountains# and four #valleys#. The valleys refer to periods where rating numbers

    were uite high because of a dominant player 2 Fischer, Dasparov (two periods), and %arlsen. The mountains mar!  periods with euality and no clear dominator. 0n the other graph there is a completely different situation. The ratings of 

    *st players of the rating lists have been relatively stable since *3@4s, while the s!ill level has steadily been rising. Inother words 2 what we see there is the deflation in the chessmetrics rating system. The rate of deflation has decreased

    somewhat since *@64s, which is logical, as the rate of improvement of playing s!ills must slac!en over time. ere too,#valleys# from domination periods can be seen. The rate of deflation is 643 points wihin *387/448 2 7.+3 points per 

    year.

    8.8 arious trends

    >reviously we loo!ed at how the strength of play had changed across the history and the two distinct rating systems.

    owever, the same can be applied to changes in other factors, such as slope and both factors of difficulty. "efore havinga closer loo! at those, a short introduction on the notion of #slope# and what it actually indicates will be presented below.

    As persons more familiar with chess !now, chess players can be split into two groups based on the nature of play-

    * positional, where intuition and !nowledge prevail

      tactical, where the speed of calculations, precision and creativity are most important

    Nsually it is !nown that nature of play dictates the choice of openings and the type of positions, but differences are also present in players# tolerances with respect to the difficulty of positions and thin!ing time. If one tries to solve a problem

     by calculating variations and possible outcomes, then it generally ta!es a lot of time before a solution is reached. 0n the

    other hand, it is universal- calculation is suitable for solving any type and however difficult problems. The advantage of  problem solutions based on !nowledge or intuition is speed. It ta!es almost no time to recall facts in memory or reali;e

    something via intuition. Their disadvantage is the fact that it is only suitable for relatively simple and more familiar 

     problems in case of solutions being illogical and une$pected, it fails. From this, the following facts follow-

    * players of positional type are relatively less sensitive to thin!ing time, but more sensitive to the difficulty of  positions

      it is contrary with tactical players- they are less sensitive to the difficulty of positions, but more sensitive tothin!ing time

    ence the fact that the si;e of the slope of the relationship between average error and a factor of difficulty depends on

     player type. Tactical players have it smaller, positional ones bigger. 1uch a phenomenon may give us a simple methodto find out which players have bigger relative importance of calculations and which ones intuitionB!nowledge in their 

    move/finding processes.

    The graph on the left shows the absolute average slope of all entities covered in this study. "ut, as it can be seen on thegraphs 6 and +, the si;e of slopes is dependent on the average error. Therefore, it is more preferable to determine how

    Graph 2: deflation in the chessmetrics rating system

    -800

    -00

    -600

    -500

    -400

    -300

    -200

    -100

    0

    100

    f(x) = 3,8x - 530,16

    !" = 0,92

    $ess%etri's rating *eflation

    &ear 

      g  a  #

    73 points per decade

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    19/24

    much the actual slope deviates from the e$pected slope. The formulas for calculating the e$pected slope are given here-

    complexity relative slope=a(solute slope

    4.6*∗avg error 4.@7

      and

    difference relative slope=a(solute slope

    4.8∗ln (avg error )+4.+@

    The smaller the digit, the bigger is the importance of calculations. The graph on the right reveals that all chess enginesare at the top half, according to e$pectation. %apablanca and Dramni! are situated at the bottom half, confirming the

    common belief that those players were primarily intuitive players. >erhaps surprisingly, Dasparov and Darpov standclose to each other and 'as!er so far down.

    The changes of the average relative slopes across time and both rating systems are presented below.

    Graph 26: players sorted (y a(solute slopes

    .lit 200

    1920s

    2300

    2600

    1900

    1860s

    2400

    2500

    1910s

    2000

    2100

    2200

    aser 

    200

    1960s1890s

    180s

    ra%ni

    i'ro-ax

    190s

    ra#i* 200

    1930s

    1900s

    as#arov

    1990s

    1940s

    2000s

    1980s

    ar#ov

    $a#a.lan'a

    1950s

    1880s

    ilou

    $raft&

    ax%an

    is'er 

    $arlsen

    7iar's

    -0,1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,

    0,45

    0,42

    0,35

    0,34

    0,34

    0,31

    0,31

    0,30

    0,29

    0,29

    0,28

    0,28

    0,28

    0,2

    0,260,26

    0,25

    0,25

    0,23

    0,22

    0,22

    0,20

    0,19

    0,19

    0,19

    0,18

    0,16

    0,15

    0,14

    0,09

    0,0

    0,06

    0,06

    0,05

    0,02

    0,01

    0,01

    -0,01

    a.solute slo#es

    'o%#lexit& *ifferen'e average

    Graph 27: players sorted (y relative slopes

    ra%ni

    2600

    1920s

    1860s

    200

    2400

    2300

    aser 

    2200

    1940s

    .lit 200

    180s

    $a#a.lan'a

    2500

    1910s

    1960s1900

    as#arov

    ar#ov

    i'ro-ax

    1890s

    190s

    1990s

    1930s

    2000s

    ra#i* 200

    $raft&

    2100

    1900s

    ilou

    1980s

    2000

    1880s

    1950s

    ax%an

    is'er 

    $arlsen

    7iar's

    -1,00 -0,50 0,00 0,50 1,00 1,50 2,00 2,50 3,00

    2,02

    1,82

    1,6

    1,50

    1,46

    1,43

    1,31

    1,30

    1,25

    1,23

    1,22

    1,22

    1,20

    1,20

    1,19

    1,101,04

    1,02

    1,00

    0,9

    0,93

    0,84

    0,84

    0,83

    0,

    0,

    0,5

    0,3

    0,69

    0,6

    0,64

    0,63

    0,52

    0,4

    0,44

    0,10

    0,04

    -0,21relative slo#es

    'o%#lexit& *ifferen'e average

    Graph 2#: relative slope across time periods

    1860s 180s 1880s 1890s 1900s 1910s 1920s 1930s 1940s 1950s 1960s 190s 1980s 1990s 2000s

    0

    0,5

    1

    1,5

    2

    2,5

    relative slo#es vs *e'a*es

    average ogarit%i' (average) 'o%#lexit&

    ogarit%i' ('o%#lexit&) *ifferen'e ogarit%i' (*ifferen'e)

    *e'a*es

      r  e   l  a   t   i  v  e  s   l  o  #  e

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    20/24

    It appears that the average relative slope slowly decreases with time the rate of decrease is even bigger in the %%&'rating system. It e$hibits a completely different behaviour in the FI rating system where stronger players have bigger slopes. In other words, today#s chess players have become more calculative that they were in the past. In a sense, it is

    logical after 9y 1ystem by ?im;owitsch there have been no significant brea!through in chess middlegame theory.

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    21/24

    >ositions have become somewhat easier compared to earlier times. specially eye/catching is the low point between

    *@4/*@84. &egarding FI rating, it loo!s li!e stronger players have a tendency to ma!e positions more complicated.

    "etter chess engines, on the other hand, end up playing in relatively easier positions.

    8.5 Influence of errors

    And as a final part, here is data on the influences of errors of various magnitude. The influence of errors is calculated by

    multiplying the freuency by its magnitude. "y comparing the resulting number with those of other errors, it can be

    ascertained which magnitudes of errors are the biggest source of inaccurate play. 0n the graphs below each bluedatapoint mar!s the product of the magnitude of an error and freuency. The red line shows the moving average of @

    datapoints. Npper graph is based on data used in this study (*+++ positions, average error 4.*64), lower graph

    represents data ta!en from an earlier study5 (*8 *+8 positions, average error 4.*7).

    5 http-BBwww.chessanalysis.eeBsummary854.pdf 

    Graph 33: complexity across $/!. rating 

    1900 2000 2100 2200 2300 2400 2500 2600 200

    0,4

    0,44

    0,48

    0,52

    0,56

    0,6

    'o%#lexit& v s ;E

    ;E 2008

      '  o  %  #   l  e  x   i   t  &

    Graph 34: difference across $/!. rating 

    1900 2000 2100 2200 2300 2400 2500 2600 200

    0,2

    0,24

    0,28

    0,32

    0,36

    0,4

    *ifferen'e vs ;E

    ;E 2008

       *   i   f   f  e  r  e  n  '  e

    Graph 3: complexity across CC+- rating 

    1800 2000 2200 2400 2600 2800 3000

    0,4

    0,44

    0,48

    0,52

    0,56

    0,6

    'o%#lexit& vs $$!

    $$! 2014

      '  o  %  #   l  e  x   i   t  &

    Graph 36: difference across CC+- rating 

    1800 2000 2200 2400 2600 2800 3000

    0,160,10,180,19

    0,20,210,220,230,240,250,26

    *ifferen'e vs $$!

    $$! 2014

       *   i   f   f  e  r  e  n  '  e

    Graph 37: influence of errors 1

    0

    4

    8

    12

    16

    20

    24

    influen'e of errors 1 (avg error 0F160)

    su% of errors

    %oving average of 9

    avg error 

      e  r  r  o  r  s  u  %

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    22/24

    espite of the fact that the average error is significantly different in both dataset, side/by/side comparison reveals, as

    shown by the red moving average line, biggest influence is roughly eual, both are around 4.4. ence the uestion-will the main source of inaccurate play also remain the same if we have a separate loo! at engines and humans with

    roughy same accuracyE 0n the following graphs, the overall average error of all engine moves is 4.438. uman moveswere grouped in two, one based on players with average error higher than 4.*44 and those whose average error was

    lower than 4.44. The average errors of all moves combined were 4.4+* and 4.87 respectively.

    Graph 3#: influence of errors 2

    0

    2

    4

    6

    8

    10

    12

    1416

    18

    20

    22

    24

    26

    influen'e of errors 2 (avg error 0F132)

    su% of errors

    %oving average of 9

    avg error 

      e  r  r  o  r  s  u  %

    Graph 3': /nfluence of engine errors

    0

    0,5

    1

    1,5

    2

    2,5

    3

    3,5

    4

    4,5

    influen'e of engine errors (avg error 0F084)

    su% of errors

    %oving average of 9

    avg error 

      e  r  r  o  r  s  u  %

    Graph 4%: /nfluence of human errors 1

    0

    0,5

    1

    1,5

    2

    2,5

    3

    3,5

    4

    4,5

    influen'e of u%an errors 1 (avg error 0F01)

    su%

    %oving average of 9

    avg error 

      e  r  r  o  r  s  u  %

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    23/24

    The results might be described as remar!able. All three graphs show that, irrespective of the accuracy of the moves, theerror influence pea! is persistently situated around 4.4 mar!. True, in the case of engines, there is a small oddity, the

     pea! seems to have a little cavity centered around 4.4, with two ridges surrounding it. It is presently un!nown whatmay be the cause for that. These graphs also e$plain why it is not recommendable to use threshold/based analysis, at

    least not with the threshold values of 4.*4 and above. At first glance, small to medium/si;ed errors may seeminsignificant, but what they lac! in gravity, they ma!e up for being more numerous.

    5. %onclusion and future perspectives

    In this study the author, using &yb!a 7, tried to measure the objective strength of play and to determine its relationship

     between wither type of rating systems. "esides that the aim was to record the change of the strength of play throughtime. In could be compared to athletics world record progression or world leading mar! tables, that provide a good

    overview of development in athletics. Nnli!e many sports today, chess has been played for centuries, and the level of  play has been since long ago been very high. As it became clear earlier, already by the beginning of the previous

    century, top players were on a par with today#s wea!er :9/s. In the light of this info, Gohn ?unn#s speculation that

    ugo 1Schting at Darlsbad tournament in *@** was merely a *44/rated player, should be regarded as a seriousunderestimation. According to chessmetrics rating after the tournament, he was only 57 points short of 'as!er, which

    would rate the latter ca 54 points lower than can be seen on the graph *+. ue to the fact that so far there has been noreliable method for measuring the uality of play, such a phenomenon as overhyping of players of the past has gained

    ground. 1urprisingly many people seriously thin! that former great figures were at least as talented as today#s top

     players, and that they played as well or even better. >sychologically completely understandable, practicallyunnecessary we all tend to see the past more beautiful than it actually was. >layers of the past are being overrated also

     because out of all their games, there is a tendency to selectively highlight better specimens, whereas in the case of contemporary players, various sites providing live engine/assisted analysis display their average level. And since the

     population of the world and the number of chess players bac! then were smaller, the same thing can be said about talent pool. It is more probable to find more naturally talented players in larger pool.

    %omparison between %%&' and FI rating systems gave a surprising conclusion. "efore that, the author held an

    opinion that both systems had an analogous relationship between strength and accuracy. It comes out that therelationships are of opposite nature. The accuracy of humans decreases logarithmically with strength of play, with

    gradually diminishing rate, but the accuracy of play of engines, on the other hand, decreases at e$ponential rate. Thefact that engines from the bottom part of the rating list are wea! has been noted already long ago. 0ne conclusion that

    can be made is that it is virtually impossible to reach negative ratings in engine rating systems, whereas it is very easy inthe FI rating system. The wea! point in the conclusion is that, because of the lac! of proper methods, it was not

     possible to rec!on with the impact of the anti/computer strategy on results. In the future it would be necessary to devisemethods to describe and research it more closely and how it depends on the strength of engines and depth of search.

    >roblematic is the relative instability of human play, which is clearly illustrated on the graph 4. It however ma!es

    coclusions somewhat untrustable. Therefore increase in the number of analy;ed moves per player is recommended.

    The most difficult part in such analysis wor!s is obviously practical play. There were no satisfactory outcomes

    regarding that. It was found out that a phenomenon that could be called #objectivity/practicality bias# is still present inresults. Therefore players# results whose difficulty of positions was far from the average must be regarded with caution.

    The more difficult positions, the bigger the probability that his result according to analysis tourns out to be underratedand, in case positions far below average, the results will be generally overrated. >reviously we saw that practical play is

    Graph 41: influence of human errors 2

    0

    1

    2

    3

    4

    5

    6

    8

    9

    10

    influen'e of u%an errors 2 (avg error 0F243)

    su%

    %oving average of 9

    avg error 

      e  r  r  o  r  s  u  %

  • 8/18/2019 Quality of Play in Chess and Methods for Measuring

    24/24

     based on differentials in the three categories- difficulty of positions, type of positions and thin!ing time. "ut there is

    also another interesting, at least a theoretical way. It is based on the fact that moves can be characteri;ed not only byuality, but also by how easilyBhard they can be noticed. 1ome moves are fairly obvious, others can seem uite illogical

    and at first glance wrong. The fact that there are a lot of positions where the best / often only / move is e$tremely hardto see, is one of chief reasons why chess is such a difficult and fascinating game. ere we have arrived at the problem

    of measuring the obviousness of moves. 0n what basis and how to measure itE ere; in

    Thessaloni!is 4*7 and Dramni! in 'ondon %lassic 4** were also remar!able performances. There e$ists a small possibility that their actual uality of play surpasses that of %arlsen. These three performances definitely deserve further 

    scrutiny. The closer a score is to *44= and the fewer games played, the more unreliable T>& value will be. Thus itwould be interesting to loo! into discrepancies between T>& and uality of play as a function of score and the number 

    of games. It is worth to pay attention to tournaments of 'as!er and %apablanca in ?ew or! in *@8 and *@+. espitethe fact that they both had almost eual chessmetrics T>&s, as shown on the graph *@, that the difference in the uality

    of play, caused by the objectivity/practicality bias is 45 points. Ta!ing into account the large amount of games and therelatively small timespan between them, it is uite probable that the uality of play of both players closely corresponds

    to the T>&s. For this reason comparing games of 'as!er and %apablanca from the ?ew or! tournaments would presumably be a good indicator of the trustability of methods used for analysis.

    &esults also showed that FI rating since *@+4 has been inflating with respect to absolute strength, with the average

    rate about 5 points per decade. It is still relatively modest which may e$plain why D.