OpenChess

Posted: **Sun Dec 16, 2012 4:36 pm**

One issue that has come up is the stylistic bias of various programs. Do some programs try harder to avoid draws? This is clearly a factor in human games.

I am trying to do an experiment to measure this, but I have found that it is difficult to the default contempt heuristics in some programs. The test is too sensitive to this because it apparently makes a bigger difference than I imagined. But I am far more interested in finding out whether certain playing styles can cause a program to avoid draws (thus increase the loss rate but also the win rate for a given score.)

I did do a test (published on talkchess) where I adjusted 3 programs to play the same strength in a long round robin match in order to the relative difference in draw rates given the same score. The program were Komodo (dev), Houdini 3 and Stockfish. Houdini 3 appeared to be the most "draw averse" by losing more games while keeping the same score as the other 2 programs. However, it turns out that Houdini has a very agressive built-in contempt factor combined with some king safety voodoo to also avoid draws. In a second test I set Komodo to have a higher contempt factor and Houdin's to zero (which is still aggressive I am told) and this made a huge difference, in fact Houdini came out way below stockfish as a very conservation and careful program. Komodo with the 23 contempt was on top. Some program allow you to set this, but not all.

I'm doing another run with spike, spark, komodo, critter and stockfish. I'm using the default parameters for each program, making no attempt to adjust contempt factors or playing style.

Posted: **Mon Dec 17, 2012 2:23 pm**

Here are the players in my study the result of a few thousands game match. Note that I made a preliminary attempt to "time adjust" the ratings so that there would not be serious mismatches:

Code: Select all

Rank    ELO     +/-    Games    Score  Player
---- ------- ------ -------- --------  ----------------------------
   1  3027.2   10.6     2938   52.280  spike14      
   2  3019.5   10.6     2940   50.901  kdev-4518.00 
   3  3015.8   10.6     2938   50.221  c16          
   4  3010.2   10.6     2940   49.218  sf23         
   5  3000.0   10.6     2938   47.379  spark1-0     

w/l/d: 2582 2332 2433    33.12 percent draws

In a previous 3 player run I was meticulous about adjusting the ratings, coming within 5 ELO of each other. But a forumla was suggested by Jesús Muñoz which in his own words looks like this:

Code: Select all

µ_i: score of the i-th engine. 
D_i: draw ratio of the i-th engine. 

c_i = (0.5 + |µ_i - 0.5|)*D_i 
(c')_i = (0.5 - |µ_i - 0.5|)*D_i

When one program is significantly stronger than another the draw rate naturally goes way down so cannot simply observe the draw rate. In tests I did this formula appears to compensate for that, I could not make one program appears more drawish than another by manipulating the handicaps.

I "normalized" the values output by this formula by displaying each result as the denominator (and the sum as the numerator) to get this table - Risk Style is positive if the program wants to avoid draws:

Code: Select all

 Percent   Percent   Percent   Percent      Risk 
Decisive      Wins    Losses     Draws     Style  Player
--------  --------  --------  --------  --------  -------------------
   68.90     31.82     37.08     31.10   5.19589  spark1-0
   66.52     33.48     33.04     33.48   5.05858  c16
   66.47     32.44     34.04     33.53   4.99435  sf23
   66.20     34.04     32.17     33.80   4.94098  kdev-4518.00
   66.28     35.42     30.86     33.72   4.82530  spike14

I am cautious to assign any deeper meaning to this - partly due to the contempt factor issue. It is simply an attempt to measure the "draw aversion" of a program in relation to other programs. Komodo has a default contempt of 7, Stockfish 0 and I have not checked the others. Some programs do not allow you to change it. With a little experimentation you can usually figure out what the contempt factor of a program is simply by setting up positions where it can force a draw and should.

I would love to run this test with many programs with contempt factors of zero.

Posted: **Tue Dec 18, 2012 11:23 pm**

Here is a cross post from talkchess showing some data that Adam Hair generated:

From the IPON data:

Code: Select all

  Name of the engine      µ    D   D_max    k     k*µ*(1 - µ) 

Houdini 3 STD            82%  24%   36%  0.6667     0.0984 
Komodo 5                 73%  34%   54%  0.6296     0.1241 
Critter 1.4a             71%  37%   58%  0.6379     0.1314 
Stockfish 2.2.2 JA       69%  40%   62%  0.6452     0.138 
Deep Rybka 4.1           68%  40%   64%  0.625      0.136 
Chiron 1.5               52%  42%   96%  0.4375     0.1092 
Deep Fritz 13 32b        51%  40%   98%  0.4082     0.102 
Naum 4.2                 50%  42%  100%  0.42       0.105 
HIARCS 14 WCSC 32b       48%  40%   96%  0.4167     0.104 
Hannibal 1.2             45%  40%   90%  0.4444     0.11 
Gull 1.2                 45%  39%   90%  0.4333     0.1073 
Deep Shredder 12         45%  40%   90%  0.4444     0.11 
Deep Sjeng c't 2010 32b  43%  41%   86%  0.4767     0.1169 
Spike 1.4 32b            42%  40%   84%  0.4762     0.116 
spark-1.0                41%  39%   82%  0.4756     0.1151 
Protector 1.4.0          39%  39%   78%  0.5        0.119 
Deep Junior 13.3         39%  34%   78%  0.4359     0.1037 
Quazar 0.4               36%  37%   72%  0.5139     0.1184 
Zappa Mexico II          32%  35%   64%  0.5469     0.119 
MinkoChess 1.3           31%  36%   62%  0.5806     0.1242

Just so I would not be left out of the fun, I have done some work on this also. I compared the draw rates for each match to an estimate of draw rate as a function of Elo difference and found the average difference for each engine (throwing out the Zappa vs Fritz results due to being an outlier). I then adjusted the average differences due to the positive correlation of draw rates to Elo ratings. The resulting percentages represent the deviation of the IPON draw rates for each engine from the expected draw rates, given the Elo difference of each match and the strength of each engine. Here is Jesús' table with my draw deviation column added:

Code: Select all

 Name of the engine      µ    D   D_max    k     k*µ*(1 - µ)    Draw deviation 

Houdini 3 STD            82%  24%   36%  0.6667     0.0984      -2.05% 
Komodo 5                 73%  34%   54%  0.6296     0.1241      -0.19% 
Critter 1.4a             71%  37%   58%  0.6379     0.1314       0.10% 
Stockfish 2.2.2 JA       69%  40%   62%  0.6452     0.138        1.68% 
Deep Rybka 4.1           68%  40%   64%  0.625      0.136        2.41% 
Chiron 1.5               52%  42%   96%  0.4375     0.1092       0.87% 
Deep Fritz 13 32b        51%  40%   98%  0.4082     0.102       -0.07% 
Naum 4.2                 50%  42%  100%  0.42       0.105        0.63% 
HIARCS 14 WCSC 32b       48%  40%   96%  0.4167     0.104       -1.06% 
Hannibal 1.2             45%  40%   90%  0.4444     0.11        -0.04% 
Gull 1.2                 45%  39%   90%  0.4333     0.1073      -1.93% 
Deep Shredder 12         45%  40%   90%  0.4444     0.11        -0.08% 
Deep Sjeng c't 2010 32b  43%  41%   86%  0.4767     0.1169       0.45% 
Spike 1.4 32b            42%  40%   84%  0.4762     0.116        0.24% 
spark-1.0                41%  39%   82%  0.4756     0.1151      -0.03% 
Protector 1.4.0          39%  39%   78%  0.5        0.119       -0.16% 
Deep Junior 13.3         39%  34%   78%  0.4359     0.1037      -5.21% 
Quazar 0.4               36%  37%   72%  0.5139     0.1184      -1.07% 
Zappa Mexico II          32%  35%   64%  0.5469     0.119        0.00% 
MinkoChess 1.3           31%  36%   62%  0.5806     0.1242       0.69%

From this, it appears that Junior 13.3 has the most draw aversion, followed by Houdini 3. Rybka 4.1 and Stockfish 2.2.2 have the least draw aversion.

Posted: **Mon Jan 28, 2013 9:04 pm**

Don wrote:With a little experimentation you can usually figure out what the contempt factor of a program is simply by setting up positions where it can force a draw and should.

It would useful if someone could collate such info, and/or give positions for such testing (and any results). My recollection is that the contempt in Rybka also has an effect on various evaluation scorings, such as asymmetric pawn structures.

OpenChess

Draw aversion

Draw aversion

Re: Draw aversion

Re: Draw aversion

Re: Draw aversion