One issue that has come up is the stylistic bias of various programs. Do some programs try harder to avoid draws? This is clearly a factor in human games.
I am trying to do an experiment to measure this, but I have found that it is difficult to the default contempt heuristics in some programs. The test is too sensitive to this because it apparently makes a bigger difference than I imagined. But I am far more interested in finding out whether certain playing styles can cause a program to avoid draws (thus increase the loss rate but also the win rate for a given score.)
I did do a test (published on talkchess) where I adjusted 3 programs to play the same strength in a long round robin match in order to the relative difference in draw rates given the same score. The program were Komodo (dev), Houdini 3 and Stockfish. Houdini 3 appeared to be the most "draw averse" by losing more games while keeping the same score as the other 2 programs. However, it turns out that Houdini has a very agressive built-in contempt factor combined with some king safety voodoo to also avoid draws. In a second test I set Komodo to have a higher contempt factor and Houdin's to zero (which is still aggressive I am told) and this made a huge difference, in fact Houdini came out way below stockfish as a very conservation and careful program. Komodo with the 23 contempt was on top. Some program allow you to set this, but not all.
I'm doing another run with spike, spark, komodo, critter and stockfish. I'm using the default parameters for each program, making no attempt to adjust contempt factors or playing style.
Draw aversion
Re: Draw aversion
Here are the players in my study the result of a few thousands game match. Note that I made a preliminary attempt to "time adjust" the ratings so that there would not be serious mismatches:
In a previous 3 player run I was meticulous about adjusting the ratings, coming within 5 ELO of each other. But a forumla was suggested by Jesús Muñoz which in his own words looks like this:
When one program is significantly stronger than another the draw rate naturally goes way down so cannot simply observe the draw rate. In tests I did this formula appears to compensate for that, I could not make one program appears more drawish than another by manipulating the handicaps.
I "normalized" the values output by this formula by displaying each result as the denominator (and the sum as the numerator) to get this table - Risk Style is positive if the program wants to avoid draws:
I am cautious to assign any deeper meaning to this - partly due to the contempt factor issue. It is simply an attempt to measure the "draw aversion" of a program in relation to other programs. Komodo has a default contempt of 7, Stockfish 0 and I have not checked the others. Some programs do not allow you to change it. With a little experimentation you can usually figure out what the contempt factor of a program is simply by setting up positions where it can force a draw and should.
I would love to run this test with many programs with contempt factors of zero.
Code: Select all
Rank ELO +/- Games Score Player
---- ------- ------ -------- -------- ----------------------------
1 3027.2 10.6 2938 52.280 spike14
2 3019.5 10.6 2940 50.901 kdev-4518.00
3 3015.8 10.6 2938 50.221 c16
4 3010.2 10.6 2940 49.218 sf23
5 3000.0 10.6 2938 47.379 spark1-0
w/l/d: 2582 2332 2433 33.12 percent draws
Code: Select all
µ_i: score of the i-th engine.
D_i: draw ratio of the i-th engine.
c_i = (0.5 + |µ_i - 0.5|)*D_i
(c')_i = (0.5 - |µ_i - 0.5|)*D_i
I "normalized" the values output by this formula by displaying each result as the denominator (and the sum as the numerator) to get this table - Risk Style is positive if the program wants to avoid draws:
Code: Select all
Percent Percent Percent Percent Risk
Decisive Wins Losses Draws Style Player
-------- -------- -------- -------- -------- -------------------
68.90 31.82 37.08 31.10 5.19589 spark1-0
66.52 33.48 33.04 33.48 5.05858 c16
66.47 32.44 34.04 33.53 4.99435 sf23
66.20 34.04 32.17 33.80 4.94098 kdev-4518.00
66.28 35.42 30.86 33.72 4.82530 spike14
I would love to run this test with many programs with contempt factors of zero.
Re: Draw aversion
Here is a cross post from talkchess showing some data that Adam Hair generated:
From the IPON data:
Just so I would not be left out of the fun, I have done some work on this also. I compared the draw rates for each match to an estimate of draw rate as a function of Elo difference and found the average difference for each engine (throwing out the Zappa vs Fritz results due to being an outlier). I then adjusted the average differences due to the positive correlation of draw rates to Elo ratings. The resulting percentages represent the deviation of the IPON draw rates for each engine from the expected draw rates, given the Elo difference of each match and the strength of each engine. Here is Jesús' table with my draw deviation column added:
From this, it appears that Junior 13.3 has the most draw aversion, followed by Houdini 3. Rybka 4.1 and Stockfish 2.2.2 have the least draw aversion.
From the IPON data:
Code: Select all
Name of the engine µ D D_max k k*µ*(1 - µ)
Houdini 3 STD 82% 24% 36% 0.6667 0.0984
Komodo 5 73% 34% 54% 0.6296 0.1241
Critter 1.4a 71% 37% 58% 0.6379 0.1314
Stockfish 2.2.2 JA 69% 40% 62% 0.6452 0.138
Deep Rybka 4.1 68% 40% 64% 0.625 0.136
Chiron 1.5 52% 42% 96% 0.4375 0.1092
Deep Fritz 13 32b 51% 40% 98% 0.4082 0.102
Naum 4.2 50% 42% 100% 0.42 0.105
HIARCS 14 WCSC 32b 48% 40% 96% 0.4167 0.104
Hannibal 1.2 45% 40% 90% 0.4444 0.11
Gull 1.2 45% 39% 90% 0.4333 0.1073
Deep Shredder 12 45% 40% 90% 0.4444 0.11
Deep Sjeng c't 2010 32b 43% 41% 86% 0.4767 0.1169
Spike 1.4 32b 42% 40% 84% 0.4762 0.116
spark-1.0 41% 39% 82% 0.4756 0.1151
Protector 1.4.0 39% 39% 78% 0.5 0.119
Deep Junior 13.3 39% 34% 78% 0.4359 0.1037
Quazar 0.4 36% 37% 72% 0.5139 0.1184
Zappa Mexico II 32% 35% 64% 0.5469 0.119
MinkoChess 1.3 31% 36% 62% 0.5806 0.1242
Just so I would not be left out of the fun, I have done some work on this also. I compared the draw rates for each match to an estimate of draw rate as a function of Elo difference and found the average difference for each engine (throwing out the Zappa vs Fritz results due to being an outlier). I then adjusted the average differences due to the positive correlation of draw rates to Elo ratings. The resulting percentages represent the deviation of the IPON draw rates for each engine from the expected draw rates, given the Elo difference of each match and the strength of each engine. Here is Jesús' table with my draw deviation column added:
Code: Select all
Name of the engine µ D D_max k k*µ*(1 - µ) Draw deviation
Houdini 3 STD 82% 24% 36% 0.6667 0.0984 -2.05%
Komodo 5 73% 34% 54% 0.6296 0.1241 -0.19%
Critter 1.4a 71% 37% 58% 0.6379 0.1314 0.10%
Stockfish 2.2.2 JA 69% 40% 62% 0.6452 0.138 1.68%
Deep Rybka 4.1 68% 40% 64% 0.625 0.136 2.41%
Chiron 1.5 52% 42% 96% 0.4375 0.1092 0.87%
Deep Fritz 13 32b 51% 40% 98% 0.4082 0.102 -0.07%
Naum 4.2 50% 42% 100% 0.42 0.105 0.63%
HIARCS 14 WCSC 32b 48% 40% 96% 0.4167 0.104 -1.06%
Hannibal 1.2 45% 40% 90% 0.4444 0.11 -0.04%
Gull 1.2 45% 39% 90% 0.4333 0.1073 -1.93%
Deep Shredder 12 45% 40% 90% 0.4444 0.11 -0.08%
Deep Sjeng c't 2010 32b 43% 41% 86% 0.4767 0.1169 0.45%
Spike 1.4 32b 42% 40% 84% 0.4762 0.116 0.24%
spark-1.0 41% 39% 82% 0.4756 0.1151 -0.03%
Protector 1.4.0 39% 39% 78% 0.5 0.119 -0.16%
Deep Junior 13.3 39% 34% 78% 0.4359 0.1037 -5.21%
Quazar 0.4 36% 37% 72% 0.5139 0.1184 -1.07%
Zappa Mexico II 32% 35% 64% 0.5469 0.119 0.00%
MinkoChess 1.3 31% 36% 62% 0.5806 0.1242 0.69%
Re: Draw aversion
It would useful if someone could collate such info, and/or give positions for such testing (and any results). My recollection is that the contempt in Rybka also has an effect on various evaluation scorings, such as asymmetric pawn structures.Don wrote:With a little experimentation you can usually figure out what the contempt factor of a program is simply by setting up positions where it can force a draw and should.