Houdini Is Top Rated Chess Engine

Adam Hair · Post by **Adam Hair** » Thu Dec 23, 2010 6:49 am

BB+ wrote:
I never realized that Graham was a Rybka beta tester. Wow.
I too was wondering about the reference here, as it seemed to me that GB was the TalkChess mod to whom he referred, while I too thought it unlikely he was a Rybka beta tester.

There is more uniformity of testing conditions intra-league as opposed to inter-league.
It is precisely this contention with which I disagree. Firstly, doing Crafty benchmarks to normalise 40/40 on a given machine is just not too precise. A given engine might be 5-10% slower or faster (relatively) due to setup issues, particularly with memory speed and/or caching. Furthermore, every tester can choose a different book, even on a match/tournament basis. These two aspects are fairly large sources of non-uniformity, even for one tester who has multiple non-identical machines and/or plays different tournaments with different books. More minor aspects of non-uniformity would be choice of TB usage and possibly GUI draw/resign rules (if applicable). In contrast, SSDF uniformised its conditions almost completely. I would like to see some actual evidence (or at least a compelling argument) that lumping together all "blitz" time controls from the alphabet of agencies is any worse than what is already extant from combining quasi-comparable data under conditions deemed equivalent.

Without a doubt there is large amount of non-uniformity within, let's say, the CCRL. Of course, the use of a single
benchmark to normalise game time controls is not very accurate, not to mention that the time control computed
from the benchmark must be modified to match the time controls that are available with a given GUI. Yet, the
time controls used by the more prominent rating lists differ a great deal more ( 40/10', 5'3" incremental, game in 10',
all with faster hardware than the CCRL norm ). The different books used within the CCRL could be a large source of non-uniformity, but the ultimate source of those books are all the same, human grandmaster games. I am not sure if that lessens the non-uniformity or not. A marked difference from the CCRL is IPON ( the use of 50 start positions ) and the CEGT ( a considerable number of games in their database start from Nunn and Noomen positions ).

Despite what I wrote above, you very well may be right. The relative rankings of the engines common to the more
prominent rating lists do not appear to differ much. Engine strength appears to be invariant, at least to some degree,
in relation to test conditions. Combining the results from various lists, as Vincent Lejeune did at one point, may in fact
be no less "accurate" then any single list.

I much appreciate CCRL, but feel that its focus has been mis-oriented, or at least mis-understood. For some reason, others have pointed to it as the "gold standard" of ratings lists, but the thing I find useful about it is that it canvasses so many engines.

The focus is misunderstood. Its greatest attribute is what you like about it; it canvasses ( or at least tries to ) many
engines. To accomplish that, resources from several people are pooled together. There lies the greatest weakness
of the CCRL, in terms of being a ratings list. The numerical ratings must be taken with a grain or two of salt.
But, as a list that informs ( or helps verify ) engine authors and the curious the relative strength of an engine, it
does a pretty good job. That is why I joined. If everybody truly understood this, they would see that it does not actually
matter whether or not the CCRL tests certain engines. IPON and SWCR would do a better job of finding the difference
in Elo between Houdini 1.5 and Rybka 4 ( single testers focused on rating the top engines ).

BB+ · Post by **BB+** » Thu Dec 23, 2010 10:58 pm

I was starting to doubt the size of error from calibration, until I actually tested it.
3314.640Mhz Phenom II X6 1055T (with popcnt):
Rybka 3 did 66193 "nps" on "go depth 15" (29.373s) IvanHoe did 2018000 nps on "benchmark"
3.003Ghz Intel Extreme X9650 (no popcnt):
Rybka 3 did 57824 nps, 12.6% slower IvanHoe did 1826000 nps, only 9.5% slower, even with no popcnt
2592.612Mhz Opteron w/o popcnt (852),
Rybka 3 at 44827 nps, 32.3% slower, IvanHoe at 1261000 nps, 37.5% slower
2293.891Mhz Opteron with popcnt,
IvanHoe at 1260000 nps, Rybka 3 at 42046.

So benchmarking really doesn't do that well. The ratios can differ by as much as 10% on this small sample (I had figured 5% or less would be more typical, but I guess Intel/AMD differences are more than I had thought).

Code: Select all

IvanHoe/Rybka3 ratio
Extreme (no popcnt) 31.58
Phenom (popcnt)     30.49
Opteron (popcnt)    29.96
Opteron (no popcnt) 28.13

As for opening books, the major factor I thought of was draw ratio. For instance, a 9:5 win ratio with 50% draws is a 50 Elo difference, and with 60% draws the same win ratio is 40 Elo. I also see various "Engine X does best Book Y" posts from now to then, but I've never determined the statistical significance therein.

My main contention is that the various time controls really don't matter much. Is 40/4 and ponder off really that different from 5m+3s with ponder on, or gaard's "go movetime 4000" tests? The time management parameters might change, but I wouldn't expect drastic differentials. I would guess 5 Elo would be typical, 10 Elo notable, and perhaps 20 Elo in rare cases. I don't see these as greatly exceeding the errors from the other (intra-agency) sources.

If everybody truly understood this, they would see that it does not actually matter whether or not the CCRL tests certain engines. IPON and SWCR would do a better job of finding the difference in Elo between Houdini 1.5 and Rybka 4 ( single testers focused on rating the top engines ).

Indeed.

BB+ · Post by **BB+** » Thu Dec 23, 2010 11:02 pm

I was wondering about the whole free copy thing when Jeremy mentioned it as a (possible) perk, though I guess Thoresen did get an early copy of Komodo beta recently.

Adam Hair · Post by **Adam Hair** » Fri Dec 24, 2010 7:59 pm

BB+ wrote:I was starting to doubt the size of error from calibration, until I actually tested it.
3314.640Mhz Phenom II X6 1055T (with popcnt):
Rybka 3 did 66193 "nps" on "go depth 15" (29.373s) IvanHoe did 2018000 nps on "benchmark"
3.003Ghz Intel Extreme X9650 (no popcnt):
Rybka 3 did 57824 nps, 12.6% slower IvanHoe did 1826000 nps, only 9.5% slower, even with no popcnt
2592.612Mhz Opteron w/o popcnt (852),
Rybka 3 at 44827 nps, 32.3% slower, IvanHoe at 1261000 nps, 37.5% slower
2293.891Mhz Opteron with popcnt,
IvanHoe at 1260000 nps, Rybka 3 at 42046.

So benchmarking really doesn't do that well. The ratios can differ by as much as 10% on this small sample (I had figured 5% or less would be more typical, but I guess Intel/AMD differences are more than I had thought).
Code: Select all
IvanHoe/Rybka3 ratio
Extreme (no popcnt) 31.58
Phenom (popcnt)     30.49
Opteron (popcnt)    29.96
Opteron (no popcnt) 28.13

I am not surprised by this. I have seen similar results.

As for opening books, the major factor I thought of was draw ratio. For instance, a 9:5 win ratio with 50% draws is a 50 Elo difference, and with 60% draws the same win ratio is 40 Elo. I also see various "Engine X does best Book Y" posts from now to then, but I've never determined the statistical significance therein.

I agree. Also, bias has some effect. Playing positions from both sides removes the bias, but at the cost of lower
the confidence in the Elo estimation. Also, the draw rate and the bias change depending on the average time
per move, to some extent. Or, perhaps more properly stated, depends on the average depth reached. This I know by comparing results for individual start positions at 40/4 and 40/40 time controls. How much these two things change
from book/opening suite to book/opening suite has not been measured, that I know of.

My main contention is that the various time controls really don't matter much. Is 40/4 and ponder off really that different from 5m+3s with ponder on, or gaard's "go movetime 4000" tests? The time management parameters might change, but I wouldn't expect drastic differentials. I would guess 5 Elo would be typical, 10 Elo notable, and perhaps 20 Elo in rare cases. I don't see these as greatly exceeding the errors from the other (intra-agency) sources.

It would seem, disregarding how time management is implemented in the engines being tested, that the time controls
don't matter much. Of course, this implies that combining results from different computer systems is no great sin
either.

I guess my point all along has been this:

In the attempt to determine the actual Elo differences between engines, all conditions could be standardized, and
appropriate measures taken to ensure those conditions do not vary outside of some predetermined tolerances.
My question would be, "How relevant would those results be?". If we changed the conditions, with the same attention
given to reduce the possible errors, the Elo differences measured would likely be different. Given that, would not
finding the relative rankings make more sense than measuring Elo differences? Determining which engine is stronger
is a much more tractable problem than determining how much stronger one engine is compared to another engine.

When the CCRL is criticized ( as well as the other rating lists ) for lack of accuracy, the criticism is correct but is
also irrelevant. To be honest, given all of the sources of variability, I would not trust any sort of testing that claimed
to give Elo differences at a 95% confidence level. At least not without this disclaimer : Under these testing conditions.

Sentinel · Post by **Sentinel** » Tue Dec 28, 2010 4:37 am

Adam Hair wrote:I agree. Also, bias has some effect. Playing positions from both sides removes the bias, but at the cost of lower the confidence in the Elo estimation.

No it doesn't. It addition to confidence estimation it also reduces the real Elo difference between the engines.

Just an example so that you can realize it. You have a book with 10% of "biased" position where both engines win.
Lets suppose the result of the match was 40+/10-/50=, so 50% draw ratio. It's 65% or 108 Elo difference.
Now if you remove 10% of biased positions the real unbiased result would be 35+/5-/50= which is 66.7% or 121 Elo difference.
So you would get 13 Elo mean error!!!

kingliveson · Post by **kingliveson** » Mon Jan 10, 2011 3:37 am

Houdini could not pull off this trick without tablebases though it was clearly a win:

[Event "Houdini vs IvanHoe, Blitz 15m+10s"] 
[Site "W0111D0001"] 
[Date "2011.01.09"] 
[Round "10"] 
[White "IvanHoe 0B.01.09 x64"] 
[Black "Houdini 1.5 x64"] 
[Result "1/2-1/2"] 
[ECO "B12"] 
[Annotator "0.37;0.06"] 
[PlyCount "182"] 
[TimeControl "900+10"] 
 
{AMD Phenom(tm) II X4 940 Processor 3010 MHz W=22.2 plies; 6,000kN/s; 98,312, 
594 TBAs B=23.9 plies; 7,237kN/s} 1. e4 c6 2. d4 d5 3. e5 Bf5 4. Nf3 e6 {Both 
last book move} 5. Nc3 {0.37/22 81} Nd7 {0.06/21 34 (Bb4)} 6. Be2 {0.25/21 142 
(a3)} Bb4 {0.00/22 23} 7. O-O {0.25/22 21} h6 {0.01/22 84 (Ne7)} 8. Bd3 {0.30/ 
22 32} Ne7 {0.07/22 31} 9. Ne2 {0.29/21 22} Bxd3 {0.05/22 42 (Ba5)} 10. Qxd3 { 
0.33/22 23} O-O {0.10/22 23 (Ng6)} 11. c3 {0.31/21 58} Ba5 {0.10/20 0} 12. b4 { 
0.29/22 135} Bc7 {0.00/21 34} 13. a4 {0.30/21 6} Qe8 {-0.04/21 20 (Nb6)} 14. 
Re1 {0.22/21 32 (b5)} Ng6 {0.07/21 22} 15. c4 {0.14/20 39 (h4)} f6 {-0.02/20 45 
} 16. exf6 {0.11/21 94} Nxf6 {-0.05/21 69} 17. Ng3 {0.11/21 13} a6 {-0.08/20 
68 (Rc8)} 18. c5 {0.07/20 37} Nd7 {-0.16/20 27} 19. Bb2 {0.00/20 28} Qf7 {-0. 
22/21 18} 20. Re3 {0.00/21 19} Rae8 {-0.19/21 65} 21. Rae1 {0.00/21 11} Nf6 { 
-0.14/21 18} 22. h3 {0.00/22 66 (Qf1)} Bxg3 {-0.12/20 18} 23. fxg3 {0.00/20 0} 
Ne4 {-0.10/22 18} 24. Kh2 {0.00/21 11} Qc7 {-0.10/21 16 (h5)} 25. Rxe4 {0.06/ 
19 0} dxe4 {-0.11/20 0} 26. Rxe4 {0.00/21 4} Ne7 {-0.07/23 97} 27. Rxe6 {0.00/ 
22 7 (d5)} Qd7 {-0.13/24 18} 28. Re5 {0.00/22 4 (Re1)} Nd5 {-0.16/23 15} 29. 
Qc4 {0.00/22 7} Qf7 {-0.16/23 22 (Rxe5)} 30. Bc1 {0.00/24 20} Re6 {-0.16/23 15 
(Rxe5)} 31. Bd2 {0.00/23 13} Rxe5 {-0.16/23 23} 32. dxe5 {0.00/21 0} Nc7 {-0. 
16/23 20 (Qe6)} 33. Qg4 {0.00/24 14 (Qc2)} Qe6 {-0.22/25 20} 34. Qe4 {-0.01/23 
8} Qd5 {-0.22/25 15} 35. Qc2 {-0.01/22 3} Ne6 {-0.22/24 14} 36. Be3 {-0.01/23 
16} Rd8 {-0.23/24 21} 37. Qf5 {-0.05/22 5} Nf8 {-0.22/24 14} 38. Bf2 {-0.11/22 
53 (Bg1)} Qc4 {-0.27/23 32} 39. a5 {-0.11/23 25 (Nd4)} Rd7 {-0.37/23 19} 40. 
Qf4 {-0.29/23 33} Qxf4 {-0.46/25 24} 41. gxf4 {-0.29/21 0} Ng6 {-0.46/27 66} 
42. f5 {-0.48/22 5} Ne7 {-0.63/25 34} 43. f6 {-0.63/23 14} gxf6 {-0.63/23 0} 
44. exf6 {-0.63/22 1} Ng6 {-0.66/26 42} 45. Be3 {-0.62/23 3} h5 {-0.66/24 0} 
46. Bd4 {-0.70/24 17} Nh4 {-0.63/26 12 (Rd5)} 47. Ng5 {-0.80/26 42} Nf5 {-0.64/ 
26 18 (Rd5)} 48. Be5 {-0.77/23 14} Rd1 {-0.65/25 17 (Rd3)} 49. g3 {-0.85/23 10} 
Nh6 {-0.59/24 25 (Re1)} 50. Bf4 {-0.83/24 17} Rb1 {-0.60/27 14 (Re1)} 51. Nf3 { 
-0.94/23 20} Rb2+ {-0.58/26 32} 52. Kg1 {-0.94/23 16} Nf5 {-0.58/27 20} 53. Ng5 
{-0.93/22 4 (Ne5)} Rxb4 {-0.82/23 11} 54. Kf2 {-0.94/22 13} Nh6 {-1.00/23 21 
(Rb3)} 55. Ne6 {-1.23/23 29 (Kf3)} Kf7 {-1.02/26 10} 56. Nd8+ {-1.23/21 0} Kg6 
{-1.02/24 0} 57. Bxh6 {-1.36/24 8} Kxh6 {-1.01/27 18} 58. Kf3 {-1.60/24 6} Kg6 
{-0.99/27 16} 59. f7 {-1.60/22 0} Kg7 {-0.99/28 12} 60. g4 {-1.60/23 9} hxg4+ { 
-0.99/27 13} 61. hxg4 {-1.60/22 0} Rb1 {-1.00/27 10} 62. Ke2 {-1.56/23 7} Kf8 { 
-1.00/27 10} 63. Kd2 {-2.21/25 36} Rb4 {-1.00/27 16 (Rg1)} 64. Kd3 {-2.94/26 
31 (Kc3)} Kg7 {-1.03/26 11 (Rxg4)} 65. g5 {-2.79/25 28 (Kc3)} Rb1 {-1.59/26 14 
(Rg4)} 66. Ke2 {-2.90/25 19} Rg1 {-1.59/24 0} 67. Kd2 {-2.90/23 3} Rf1 {-1.58/ 
26 10 (Rg3)} 68. Nxb7 {-2.86/23 13} Rxf7 {-1.58/24 0} 69. Nd6 {-2.86/22 0} Rf3 
{-1.59/27 11 (Rf8)} 70. Ne8+ {-1.50/18 3} Kg6 {-1.59/28 16 (Kf7)} 71. Nc7 {0. 
00/18 4} Ra3 {-1.59/26 1} 72. Nxa6 {0.00/16 0} Rxa5 {-1.66/26 23} 73. Nb8 {0. 
00/21 2} Rxc5 {-1.66/24 0} 74. Ke3 {0.00/28 2} Kxg5 {-1.06/26 31 (Rc2)} 75. Kd4 
{0.00/0 0} Rd5+ {0.00/0 0} 76. Kc4 {0.00/0 0} Kf5 {0.00/0 0} 77. Nxc6 {0.00/0 0 
} Ke4 {0.00/0 0} 78. Kb3 {0.00/0 0} Rc5 {0.00/0 0} 79. Nb4 {0.00/0 0} Rc7 {0. 
00/0 0} 80. Ka2 {0.00/0 0} Kd4 {0.00/0 0} 81. Kb2 {0.00/0 0} Kc4 {0.00/0 0} 82. 
Na2 {0.00/0 0} Rd7 {0.00/0 0} 83. Kb1 {0.00/0 0} Rd2 {0.00/0 0} 84. Nc1 {0.00/ 
0 0} Kb4 {0.00/0 0} 85. Na2+ {0.00/0 0} Kc4 {0.00/0 0} 86. Nc1 {0.00/0 0} Rg2 { 
0.00/0 0} 87. Na2 {0.00/0 0} Rf2 {0.00/0 0} 88. Nc1 {0.00/0 0} Kc3 {0.00/0 0} 
89. Na2+ {0.00/0 0} Kc4 {0.00/0 0} 90. Nc1 {0.00/0 0} Kc3 {0.00/0 0} 91. Na2+ { 
0.00/0 0} Kc4 {0.00/0 0 Draw accepted} 1/2-1/2

ernest · Post by **ernest** » Mon Jan 10, 2011 7:15 pm

kingliveson wrote:Houdini could not pull off this trick without tablebases though it was clearly a win

At what moment do you think it was "clearly a win"

kingliveson · Post by **kingliveson** » Mon Jan 10, 2011 8:27 pm

ernest wrote:
kingliveson wrote:Houdini could not pull off this trick without tablebases though it was clearly a win
At what moment do you think it was "clearly a win"

Here's replays with IvanHoe (using 3-4-5-Z RobboBases) vs Houdini.

1.

[Event "Computer chess game"] 
[Site "W0111D0001"] 
[Date "2011.01.10"] 
[Round "?"] 
[White "Houdini 1.5 x64"] 
[Black "IvanHoe 0B.01.09 x64"] 
[Result "0-1"] 
[BlackElo "2800"] 
[ECO "8/8/8/"] 
[Opening "8/8/7r/5k2/7K w - - 24 101 "] 
[Variation "Black wins in 1"] 
[WhiteElo "2800"] 
[TimeControl "300+0"] 
[SetUp "1"] 
[FEN "3N4/1p6/p1p2Pkn/P1P4p/1r3B2/6PP/5K2/8 w - - 6 57"] 
[Termination "normal"] 
[PlyCount "88"] 
[WhiteType "program"] 
[BlackType "program"] 
 
57. Bxh6 Kxh6 58. Kf3 Kg6 59. f7 Kg7 60. Ke3 Rb3+ 61. Kf4 Rd3 62. Nxb7 Kxf7 
63. g4 Rxh3 64. Nd8+ Ke8 65. Nxc6 hxg4 66. Kxg4 Rc3 67. Kf4 Rxc5 68. Nd4 
Rxa5 69. Ke3 Ra3+ 70. Kd2 a5 71. Kc2 a4 72. Ne6 Rh3 73. Kb1 a3 74. Nf4 Rh2 
75. Nd5 Kd7 76. Ne3 Kc6 77. Ng4 a2+ 78. Ka1 Rh1+ 79. Kxa2 Kc5 80. Ne3 Kd4 
81. Nf5+ Kc3 82. Nd6 Rh5 83. Ne8 Rf5 84. Nd6 Re5 85. Nc8 Re6 86. Kb1 Kb3 
87. Kc1 Rc6+ 88. Kd2 Rxc8 89. Ke3 Rc4 90. Kd3 Rh4 91. Ke2 Rh3 92. Kf2 Rc3 
93. Kg2 Kc2 94. Kf1 Kd1 95. Kf2 Ra3 96. Kf1 Ra2 97. Kg1 Ke1 98. Kh1 Kf2 99. 
Kh2 Ra3 100. Kh1 Rh3# 0-1

2.

3.

[Event "Computer chess game"] 
[Site "W0111D0001"] 
[Date "2011.01.10"] 
[Round "?"] 
[White "Houdini 1.5 x64"] 
[Black "IvanHoe 0B.01.09 x64"] 
[Result "0-1"] 
[BlackElo "2800"] 
[ECO "8/8/8/"] 
[Opening "8/8/5k2/6qK/8 w - - 13 109 "] 
[Variation "Black wins in 1"] 
[WhiteElo "2800"] 
[TimeControl "300+0"] 
[SetUp "1"] 
[FEN "3N4/1p3Pk1/p1p5/P1P3P1/1r6/3K4/8/8 b - - 0 65"] 
[Termination "normal"] 
[PlyCount "87"] 
[WhiteType "program"] 
[BlackType "program"] 
 
65. ... Rb1 66. Ke2 Rg1 67. Kd2 Kf8 68. Kc3 Rd1 69. Nxb7 Kxf7 70. Nd6+ Kg6 
71. Ne4 Ra1 72. Kb4 Re1 73. Nf2 Kxg5 74. Nd3 Re4+ 75. Kc3 Re3 76. Kd4 Rxd3+ 
77. Kxd3 Kf4 78. Kd4 Kf5 79. Kc3 Ke4 80. Kc4 Ke5 81. Kb3 Kd4 82. Kb4 Kd5 
83. Kb3 Kxc5 84. Kc3 Kb5 85. Kd4 Kxa5 86. Kc5 Ka4 87. Kc4 a5 88. Kc3 Ka3 
89. Kc4 Kb2 90. Kc5 a4 91. Kd4 a3 92. Kc4 a2 93. Kd4 a1=Q 94. Ke5 c5 95. 
Ke6 c4 96. Kd5 c3 97. Ke4 c2 98. Kd5 c1=Q 99. Ke6 Qab1 100. Ke5 Qa2 101. 
Kf6 Qg5+ 102. Kxg5 Qe6 103. Kh5 Qg8 104. Kh4 Kc2 105. Kh3 Kd3 106. Kh4 Ke4 
107. Kh3 Kf3 108. Kh2 Qg2# 0-1

You can also check out BB+'s analysis here.

ernest · Post by **ernest** » Mon Jan 10, 2011 8:39 pm

Also at move 70, with FEN: 4N3/6k1/p1p5/P1P3P1/8/5r2/3K4/8 b - - 3 70

70... Kg6? (which was played by Houdini) only draws, while 70... Kf7! is the only move to win.

kingliveson · Post by **kingliveson** » Mon Jan 10, 2011 8:49 pm

ernest wrote:Also at move 70, with FEN: 4N3/6k1/p1p5/P1P3P1/8/5r2/3K4/8 b - - 3 70

70... Kg6? (which was played by Houdini) only draws, while 70... Kf7! is the only move to win.

And the worse part of it?

Houdini vs IvanHoe, Blitz 15m+10s  
                                
1   Houdini 1.5 x64        +95  0½1010½½1½11½111½½1½1½½½½½½1½½   19.0/30
2   IvanHoe 0B.01.09 x64   -95  1½0101½½0½00½000½½0½0½½½½½½0½½   11.0/30

OpenChess

OpenChess

Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine

Re: Houdini Is Top Rated Chess Engine