Houdini Is Top Rated Chess Engine

General discussion about computer chess...
Adam Hair
Posts: 104
Joined: Fri Jun 11, 2010 4:29 am
Real Name: Adam Hair
Contact:

Re: Houdini Is Top Rated Chess Engine

Post by Adam Hair »

BB+ wrote:
I never realized that Graham was a Rybka beta tester. Wow.
I too was wondering about the reference here, as it seemed to me that GB was the TalkChess mod to whom he referred, while I too thought it unlikely he was a Rybka beta tester.
There is more uniformity of testing conditions intra-league as opposed to inter-league.
It is precisely this contention with which I disagree. Firstly, doing Crafty benchmarks to normalise 40/40 on a given machine is just not too precise. A given engine might be 5-10% slower or faster (relatively) due to setup issues, particularly with memory speed and/or caching. Furthermore, every tester can choose a different book, even on a match/tournament basis. These two aspects are fairly large sources of non-uniformity, even for one tester who has multiple non-identical machines and/or plays different tournaments with different books. More minor aspects of non-uniformity would be choice of TB usage and possibly GUI draw/resign rules (if applicable). In contrast, SSDF uniformised its conditions almost completely. I would like to see some actual evidence (or at least a compelling argument) that lumping together all "blitz" time controls from the alphabet of agencies is any worse than what is already extant from combining quasi-comparable data under conditions deemed equivalent.
Without a doubt there is large amount of non-uniformity within, let's say, the CCRL. Of course, the use of a single
benchmark to normalise game time controls is not very accurate, not to mention that the time control computed
from the benchmark must be modified to match the time controls that are available with a given GUI. Yet, the
time controls used by the more prominent rating lists differ a great deal more ( 40/10', 5'3" incremental, game in 10',
all with faster hardware than the CCRL norm ). The different books used within the CCRL could be a large source of non-uniformity, but the ultimate source of those books are all the same, human grandmaster games. I am not sure if that lessens the non-uniformity or not. A marked difference from the CCRL is IPON ( the use of 50 start positions ) and the CEGT ( a considerable number of games in their database start from Nunn and Noomen positions ).

Despite what I wrote above, you very well may be right. The relative rankings of the engines common to the more
prominent rating lists do not appear to differ much. Engine strength appears to be invariant, at least to some degree,
in relation to test conditions. Combining the results from various lists, as Vincent Lejeune did at one point, may in fact
be no less "accurate" then any single list.
I much appreciate CCRL, but feel that its focus has been mis-oriented, or at least mis-understood. For some reason, others have pointed to it as the "gold standard" of ratings lists, but the thing I find useful about it is that it canvasses so many engines.
The focus is misunderstood. Its greatest attribute is what you like about it; it canvasses ( or at least tries to ) many
engines. To accomplish that, resources from several people are pooled together. There lies the greatest weakness
of the CCRL, in terms of being a ratings list. The numerical ratings must be taken with a grain or two of salt.
But, as a list that informs ( or helps verify ) engine authors and the curious the relative strength of an engine, it
does a pretty good job. That is why I joined. If everybody truly understood this, they would see that it does not actually
matter whether or not the CCRL tests certain engines. IPON and SWCR would do a better job of finding the difference
in Elo between Houdini 1.5 and Rybka 4 ( single testers focused on rating the top engines ).
BB+
Posts: 1484
Joined: Thu Jun 10, 2010 4:26 am

Re: Houdini Is Top Rated Chess Engine

Post by BB+ »

I was starting to doubt the size of error from calibration, until I actually tested it.
3314.640Mhz Phenom II X6 1055T (with popcnt):
Rybka 3 did 66193 "nps" on "go depth 15" (29.373s) IvanHoe did 2018000 nps on "benchmark"
3.003Ghz Intel Extreme X9650 (no popcnt):
Rybka 3 did 57824 nps, 12.6% slower IvanHoe did 1826000 nps, only 9.5% slower, even with no popcnt
2592.612Mhz Opteron w/o popcnt (852),
Rybka 3 at 44827 nps, 32.3% slower, IvanHoe at 1261000 nps, 37.5% slower
2293.891Mhz Opteron with popcnt,
IvanHoe at 1260000 nps, Rybka 3 at 42046.

So benchmarking really doesn't do that well. The ratios can differ by as much as 10% on this small sample (I had figured 5% or less would be more typical, but I guess Intel/AMD differences are more than I had thought).

Code: Select all

IvanHoe/Rybka3 ratio
Extreme (no popcnt) 31.58
Phenom (popcnt)     30.49
Opteron (popcnt)    29.96
Opteron (no popcnt) 28.13
As for opening books, the major factor I thought of was draw ratio. For instance, a 9:5 win ratio with 50% draws is a 50 Elo difference, and with 60% draws the same win ratio is 40 Elo. I also see various "Engine X does best Book Y" posts from now to then, but I've never determined the statistical significance therein.

My main contention is that the various time controls really don't matter much. Is 40/4 and ponder off really that different from 5m+3s with ponder on, or gaard's "go movetime 4000" tests? The time management parameters might change, but I wouldn't expect drastic differentials. I would guess 5 Elo would be typical, 10 Elo notable, and perhaps 20 Elo in rare cases. I don't see these as greatly exceeding the errors from the other (intra-agency) sources.
If everybody truly understood this, they would see that it does not actually matter whether or not the CCRL tests certain engines. IPON and SWCR would do a better job of finding the difference in Elo between Houdini 1.5 and Rybka 4 ( single testers focused on rating the top engines ).
Indeed.
BB+
Posts: 1484
Joined: Thu Jun 10, 2010 4:26 am

Re: Houdini Is Top Rated Chess Engine

Post by BB+ »

I was wondering about the whole free copy thing when Jeremy mentioned it as a (possible) perk, though I guess Thoresen did get an early copy of Komodo beta recently. :)
Adam Hair
Posts: 104
Joined: Fri Jun 11, 2010 4:29 am
Real Name: Adam Hair
Contact:

Re: Houdini Is Top Rated Chess Engine

Post by Adam Hair »

BB+ wrote:I was starting to doubt the size of error from calibration, until I actually tested it.
3314.640Mhz Phenom II X6 1055T (with popcnt):
Rybka 3 did 66193 "nps" on "go depth 15" (29.373s) IvanHoe did 2018000 nps on "benchmark"
3.003Ghz Intel Extreme X9650 (no popcnt):
Rybka 3 did 57824 nps, 12.6% slower IvanHoe did 1826000 nps, only 9.5% slower, even with no popcnt
2592.612Mhz Opteron w/o popcnt (852),
Rybka 3 at 44827 nps, 32.3% slower, IvanHoe at 1261000 nps, 37.5% slower
2293.891Mhz Opteron with popcnt,
IvanHoe at 1260000 nps, Rybka 3 at 42046.

So benchmarking really doesn't do that well. The ratios can differ by as much as 10% on this small sample (I had figured 5% or less would be more typical, but I guess Intel/AMD differences are more than I had thought).

Code: Select all

IvanHoe/Rybka3 ratio
Extreme (no popcnt) 31.58
Phenom (popcnt)     30.49
Opteron (popcnt)    29.96
Opteron (no popcnt) 28.13
I am not surprised by this. I have seen similar results.
As for opening books, the major factor I thought of was draw ratio. For instance, a 9:5 win ratio with 50% draws is a 50 Elo difference, and with 60% draws the same win ratio is 40 Elo. I also see various "Engine X does best Book Y" posts from now to then, but I've never determined the statistical significance therein.
I agree. Also, bias has some effect. Playing positions from both sides removes the bias, but at the cost of lower
the confidence in the Elo estimation. Also, the draw rate and the bias change depending on the average time
per move, to some extent. Or, perhaps more properly stated, depends on the average depth reached. This I know by comparing results for individual start positions at 40/4 and 40/40 time controls. How much these two things change
from book/opening suite to book/opening suite has not been measured, that I know of.
My main contention is that the various time controls really don't matter much. Is 40/4 and ponder off really that different from 5m+3s with ponder on, or gaard's "go movetime 4000" tests? The time management parameters might change, but I wouldn't expect drastic differentials. I would guess 5 Elo would be typical, 10 Elo notable, and perhaps 20 Elo in rare cases. I don't see these as greatly exceeding the errors from the other (intra-agency) sources.
It would seem, disregarding how time management is implemented in the engines being tested, that the time controls
don't matter much. Of course, this implies that combining results from different computer systems is no great sin
either.

I guess my point all along has been this:

In the attempt to determine the actual Elo differences between engines, all conditions could be standardized, and
appropriate measures taken to ensure those conditions do not vary outside of some predetermined tolerances.
My question would be, "How relevant would those results be?". If we changed the conditions, with the same attention
given to reduce the possible errors, the Elo differences measured would likely be different. Given that, would not
finding the relative rankings make more sense than measuring Elo differences? Determining which engine is stronger
is a much more tractable problem than determining how much stronger one engine is compared to another engine.

When the CCRL is criticized ( as well as the other rating lists ) for lack of accuracy, the criticism is correct but is
also irrelevant. To be honest, given all of the sources of variability, I would not trust any sort of testing that claimed
to give Elo differences at a 95% confidence level. At least not without this disclaimer : Under these testing conditions.
Sentinel
Posts: 122
Joined: Thu Jun 10, 2010 12:49 am
Real Name: Milos Stanisavljevic

Re: Houdini Is Top Rated Chess Engine

Post by Sentinel »

Adam Hair wrote:I agree. Also, bias has some effect. Playing positions from both sides removes the bias, but at the cost of lower the confidence in the Elo estimation.
No it doesn't. It addition to confidence estimation it also reduces the real Elo difference between the engines.

Just an example so that you can realize it. You have a book with 10% of "biased" position where both engines win.
Lets suppose the result of the match was 40+/10-/50=, so 50% draw ratio. It's 65% or 108 Elo difference.
Now if you remove 10% of biased positions the real unbiased result would be 35+/5-/50= which is 66.7% or 121 Elo difference.
So you would get 13 Elo mean error!!!
kingliveson
Posts: 1388
Joined: Thu Jun 10, 2010 1:22 am
Real Name: Franklin Titus
Location: 28°32'1"N 81°22'33"W

Re: Houdini Is Top Rated Chess Engine

Post by kingliveson »

Houdini could not pull off this trick without tablebases though it was clearly a win:

PAWN : Knight >> Bishop >> Rook >>Queen
ernest
Posts: 247
Joined: Thu Sep 02, 2010 10:33 am

Re: Houdini Is Top Rated Chess Engine

Post by ernest »

kingliveson wrote:Houdini could not pull off this trick without tablebases though it was clearly a win
At what moment do you think it was "clearly a win" :?:
kingliveson
Posts: 1388
Joined: Thu Jun 10, 2010 1:22 am
Real Name: Franklin Titus
Location: 28°32'1"N 81°22'33"W

Re: Houdini Is Top Rated Chess Engine

Post by kingliveson »

ernest wrote:
kingliveson wrote:Houdini could not pull off this trick without tablebases though it was clearly a win
At what moment do you think it was "clearly a win" :?:
Here's replays with IvanHoe (using 3-4-5-Z RobboBases) vs Houdini.

1.


2.


3.

You can also check out BB+'s analysis here.
PAWN : Knight >> Bishop >> Rook >>Queen
ernest
Posts: 247
Joined: Thu Sep 02, 2010 10:33 am

Re: Houdini Is Top Rated Chess Engine

Post by ernest »

Also at move 70, with FEN: 4N3/6k1/p1p5/P1P3P1/8/5r2/3K4/8 b - - 3 70

70... Kg6? (which was played by Houdini) only draws, while 70... Kf7! is the only move to win.
kingliveson
Posts: 1388
Joined: Thu Jun 10, 2010 1:22 am
Real Name: Franklin Titus
Location: 28°32'1"N 81°22'33"W

Re: Houdini Is Top Rated Chess Engine

Post by kingliveson »

ernest wrote:Also at move 70, with FEN: 4N3/6k1/p1p5/P1P3P1/8/5r2/3K4/8 b - - 3 70

70... Kg6? (which was played by Houdini) only draws, while 70... Kf7! is the only move to win.
And the worse part of it?
Houdini vs IvanHoe, Blitz 15m+10s  
                                
1   Houdini 1.5 x64        +95  0½1010½½1½11½111½½1½1½½½½½½1½½   19.0/30
2   IvanHoe 0B.01.09 x64   -95  1½0101½½0½00½000½½0½0½½½½½½0½½   11.0/30
PAWN : Knight >> Bishop >> Rook >>Queen
Post Reply