OpenChess

Posted: **Fri Feb 11, 2011 2:24 am**

kingliveson wrote:You could not get a more fairer tournament; equal hardware, the same openings (reverse color), equal time, etc.

There is no reason to believe that Rybka 4.0 would have still come out ahead of Houdini 1.5a with both engines using 200+ cores -- though it would probably be reasonable to say that Rybka has gotten a lot of development head start regarding parallel search given its available resources to test it.

Unfortunately, this is a classic "non-author" claim. As a simple counter-point. For Cray Blitz, and for Crafty today, we _know_ that there are some openings we don't play particularly well, for several reasons. We could stop and try to solve those, or we can just avoid those kinds of positions for the moment and keep working. In my testing of Crafty, I don't pick and choose openings, I am trying to test/tune for "all" positions. But in a tournament, playing "all" positions can cost you 200 Elo or more. It would not be unexpected for a program outside the top ten on a rating list using a "single book" to win a WCCC event due to superior opening choices that led the programs into positions where it plays exceptionally well, while avoiding those where it plays exceptionally badly.

Certainly, as a human chess player, I don't play every opening when I play the game. I have favourites that I have studied and understand, and I play those when the games mean something, as in a tournament as opposed to a club meeting playing 5 minute chess. If a human doesn't let you tell -him- which openings he must play, why is this then OK for a computer event? But wait, humans _do_ let you tell them if you decide to organize a "thematic tournament." But then everyone knows that is not a true indicator of overall chess skill. What about a program that has a new way of pondering so that it spends time on several moves. And you turn that off in a "no-ponder" tournament and disable a new idea that might be worth a significant number of Elo points.

While at it, why not arbitrarily change the value of a queen to 8 for all programs. That would still be "equal", correct? This idea that it is ok to arbitrarily disable something if you disable it for all is basically flawed. And that is why many criticize the approach.

Posted: **Fri Feb 11, 2011 2:48 am**

hyatt wrote:
kingliveson wrote:You could not get a more fairer tournament; equal hardware, the same openings (reverse color), equal time, etc.

There is no reason to believe that Rybka 4.0 would have still come out ahead of Houdini 1.5a with both engines using 200+ cores -- though it would probably be reasonable to say that Rybka has gotten a lot of development head start regarding parallel search given its available resources to test it.
Unfortunately, this is a classic "non-author" claim. As a simple counter-point. For Cray Blitz, and for Crafty today, we _know_ that there are some openings we don't play particularly well, for several reasons. We could stop and try to solve those, or we can just avoid those kinds of positions for the moment and keep working. In my testing of Crafty, I don't pick and choose openings, I am trying to test/tune for "all" positions. But in a tournament, playing "all" positions can cost you 200 Elo or more. It would not be unexpected for a program outside the top ten on a rating list using a "single book" to win a WCCC event due to superior opening choices that led the programs into positions where it plays exceptionally well, while avoiding those where it plays exceptionally badly.

I don't think that you can look at it that way because here we are dealing with A.I. and its ability to adjust with both engines given the same legal chess starting positions. I see it to be more or less a generic test of the programs. It is simply saying what can engine A or B do given these sets of conditions.

Certainly, as a human chess player, I don't play every opening when I play the game. I have favourites that I have studied and understand, and I play those when the games mean something, as in a tournament as opposed to a club meeting playing 5 minute chess. If a human doesn't let you tell -him- which openings he must play, why is this then OK for a computer event? But wait, humans _do_ let you tell them if you decide to organize a "thematic tournament." But then everyone knows that is not a true indicator of overall chess skill. What about a program that has a new way of pondering so that it spends time on several moves. And you turn that off in a "no-ponder" tournament and disable a new idea that might be worth a significant number of Elo points.

These are programs capable of that which no human can today, so you can't really make that comparisons. In computer tournaments as you mentioned, using an opening book is allowed, but such is not the case with human chess. Besides, it would be a great idea to set up a tournament in which starting positions are randomly selected and watch top GMs battle it out.

Those who would argue that the test is as fair as can be for generic strength, definitely could say the engines are using default settings. For example, pondering by default according to UCI protocol is not on. So we have to take the tournament for what it is which is not optimal configuration for either program.

While at it, why not arbitrarily change the value of a queen to 8 for all programs. That would still be "equal", correct? This idea that it is ok to arbitrarily disable something if you disable it for all is basically flawed. And that is why many criticize the approach.

Posted: **Fri Feb 11, 2011 8:49 am**

kingliveson wrote:It is simply saying what can engine A or B do given these sets of conditions.

But the "set of conditions" are less optimal to differing degrees depending on the AI. One cannot call the results of such methods "a rating list of current programs" the implication being that the conditions are fair, because they are not fair. One might as well rate teams of footballers playing from wheelchairs instead of their legs.

kingliveson wrote:... it would be a great idea to set up a tournament in which starting positions are randomly selected and watch top GMs battle it out.

I agree that it might be, but a rating list based it would not tell us what we really want to know.

kingliveson wrote:...pondering by default according to UCI protocol is not on. So we have to take the tournament for what it is which is not optimal configuration for either program.

A rating list of arbitrarily non-optimally set-up programs yields nothing more significant than a sequence of artificially random numbers.

Posted: **Fri Feb 11, 2011 9:44 am**

orgfert wrote:A rating list of arbitrarily non-optimally set-up programs yields nothing more significant than a sequence of artificially random numbers.

That's a reductio ad absurdum. They just aren't testing what you think is important.

I agree with you, to a point -- engines should be able to showcase their strengths beyond the realm of search and evaluation. But it's decidedly unfair to allow Engine A to compete with 200 cores against Engine B on 8 and expect the results to be significant, either. In the end, you don't know whether it was the engine or the hardware that won.

Play with own opening book (in case available)? Sure.
Play with ponder? Sure.
Play using best software settings? Sure (presuming the manufacturer can provide them, why not).

But it's obviously unfair to allow developers who can afford massive hardware to use it against developers forced to run on ordinary machines. Why is that so hard to swallow? "Magnus Carlson doesn't leave half of his brain at home when he goes to Wijk Aan Zee" is sophistry: Magnus Carlson is not configurable software, capable of running on a wide variety of hardware. I understand wanting to show your best stuff in a competition, but there should be limits, because it's highly disadvantageous to those developers who cannot afford a private or university cluster to play against those who can.

From the perspective of engine authors, I can understand not wanting their engine to be reduced to a benchmarking tool. But I think that there are ways to accommodate this wish, without destroying the potential for reasonably equal chances for all contenders.

Jeremy

Posted: **Fri Feb 11, 2011 2:59 pm**

Trying to claim that some how Rybka is not getting a fair shake is really funny. Houdini is beating Rybka ...book ...no book ...short time control...long time control... ponder on ponder off and everywhere in between. So Rybka fans need to stop grasping at straws and face reality. Houdini is better all the way around and nothing is unfair about the test. Go to any forum and you can see many other tests with other time controls ...other hardware other everything and the results are still the same. Bottom line Houdini is considerably stronger for 99% of computer owners. If Rybka's claim to fame is all based on her book ...then its really not the program then is it? Most dont own 200 cores so to base anything on that is utterly ridiculous imho.

BT

Posted: **Fri Feb 11, 2011 4:07 pm**

Martin Thoresen wrote:Yes, Chessvibes put their article up today.

Best,
Martin

Rybka website before the article:

Rybka website after the article:

Posted: **Fri Feb 11, 2011 4:59 pm**

First it was the false "clone claims" without facts. Then censoring of the truth in their forum. It's now the game of 'deleting sites/rating lists' exposing the truth?! The Rybka forum moderator/s childishness never ceases to amaze me

.

Thanks Martin Thoresen for your work. Much appreciated. Please keep up the good, honest work.

Posted: **Fri Feb 11, 2011 6:26 pm**

Prima wrote:First it was the false "clone claims" without facts. Then censoring of the truth in their forum. It's now the game of 'deleting sites/rating lists' exposing the truth?! The Rybka forum moderator/s childishness never ceases to amaze me .

Thanks Martin Thoresen for your work. Much appreciated. Please keep up the good, honest work.

Thank you very much.

Division F kicks off in just about 30 minutes.

Best,
Martin

Posted: **Fri Feb 11, 2011 8:39 pm**

Jeremy Bernstein wrote:
orgfert wrote:A rating list of arbitrarily non-optimally set-up programs yields nothing more significant than a sequence of artificially random numbers.
That's a reductio ad absurdum. They just aren't testing what you think is important.

The reductio to absurdity is the arbitrary dumbing-down of some AI with respect to others. Testing that way is of no intellectual benefit.

Jeremy Bernstein wrote:I agree with you, to a point -- engines should be able to showcase their strengths beyond the realm of search and evaluation. But it's decidedly unfair to allow Engine A to compete with 200 cores against Engine B on 8 and expect the results to be significant, either.

Fairness isn't an issue as long as both sides play by the rules. It is the disparity in strength that is of interest. If you are going to effectively hobble more advanced AIs, you might as well level the field with pawn odds and call that a rating list.

Jeremy Bernstein wrote:In the end, you don't know whether it was the engine or the hardware that won.

I iterate the issue of forcing Cray Blitz to run on a PDP-11 along with all the other AIs of that era. The notion rating AI skills mainly, mostly and nearly exclusively with that technique is, from an objective standpoint, perfectly blithering.

Jeremy Bernstein wrote:But it's obviously unfair to allow developers who can afford massive hardware to use it against developers forced to run on ordinary machines. Why is that so hard to swallow?

... I understand wanting to show your best stuff in a competition, but there should be limits, because it's highly disadvantageous to those developers who cannot afford a private or university cluster to play against those who can.

What can be afforded isn't interesting and doesn't matter if things are done properly. A rating FICS as suggested above could provide real ratings on diverse hardware with configuration accounts like Sjeng200cpu, Sjeng-1cpu, crafty23.4-16cpu, crafty23.4-2cpu, houdini2.0-2cpu, etc. No crippleware, no arbitrary settings. Real games between design-intended setups. Who cares if program A couldn't afford a cluster? It doesn't matter when Sjeng-1cpu can be compared to program-A-1cpu and then they can both be compared to sjeng-200cpu, because they are all in the same rating pool on the same rating FICS.

THAT would be a rating list.

Posted: **Fri Feb 11, 2011 8:50 pm**

orgfert, please read the information page.

http://www.tcec-chess.org/info.php

The goal of TCEC is to provide the viewers with a live broadcast of quality chess, played strictly between computer chess engines created by different programmers.

An important point to remember is that TCEC is by no means a "rating list" for computer chess engines that shows how strong engine X is relative to engine Y or Z after letting them play hundreds or even thousands of games. For that we have the excellent SWCR, IPON, CEGT and CCRL.

I think your posts doesn't make sense at all. Dumbed-down Rybka?
Your analogy is the same as saying that a car with 300 hp which is driven legally at 100 km/h on the highway is dumbed down because it can theoretically run at 250 km/h.

Best,
Martin

OpenChess

Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match

Re: Houdini routs Rybka to start, routs Rybka to end match