orgfert wrote:
I don't see the relevance of testing programs that have been stripped to the least common denominator. I thought the point would be to discover the relative strengths of competing designs. Listing the strength of a lobotomized design looks meaningless.
You make it sound as if there is not very much left after turning off books, ponder, and learning. I believe many
authors would disagree with you.
orgfert wrote:
So you will admit that the few that use these techniques are put at a disadvantage? This would invalidate their standings in the lists, yes?
Put at a disadvantage? No.
Invalidate their standing? No.
I would think some of the extras are there in order to make the chess engine play more interesting, not because
it would help it be higher on any rating list.
orgfert wrote:
Not to mention that proliferation of these methods gives no incentive for programmers to design outside the boundaries of the testing methods, since they know anything else will be disabled by the list makers. And it seems a tacit opinion inherent in the testing process that chess on computers should not be treated a AI, since it's treatment as "intelligent behavior" is subject to surgery to dumb-down designs to a common level. We would never do this to biological intelligence.
IOW, computer chess is not considered AI by the masses, but only the stripped-down designs to the search itself, toss learning, full time management, self-tuned books, etc, even though they are striving for a scientific process in testing. But they aren't discovering the relative strengths of each AI design at all.
Do you really think that list makers determine what should be in a chess program? You give the whole group too
much credit. Some authors undoubtly strive to climb the lists. Others pay more attention to giving their program
a full set of features.
Anybody who does not understand that computer chess is artificial intellegence needs to do some reading. Yet,
simply testing for engine strength does not dismiss that connection. How do you think Bob Hyatt tests Crafty?
With books, ponder, and learning on? No. When he competes with Crafty, then yes. But when he wants to find
out if some changes in the code makes Crafty stronger, all of that is turned off. The same for other authors.
And the rating lists serve as a check for them.
orgfert wrote:
Adam Hair wrote:6) Each rating list is an attempt at something approaching a scientific measurement of engine strength. How close
the approach comes is open to opinion.

In each case, there is an attempt to eliminate sources of variation.
Sometimes there are some trade offs ( more testers allow for more games and engines but creates more statistical
noise), but at least there is some idea of each engine's strength ( there are many more that should be been tested).
orgfert wrote:
This approach fundamentally destroys many design elements of a computer chess player's strength. Even if you discover that specific elements tend to make little difference, you are blinding the test to potentially effective strategies when they arrive in newer, more innovative versions.
Therefore, testing should be careful to include all design elements in a system for evaluation, whether they are deemed to differentiate or not. This is a fundamental principle that should never be violated.
orgfert wrote:
Adam Hair wrote: This is done quite often in science : Define what you are trying to measure, try to eliminate sources of variation, then measure it.
This is not a concern in chess player rating lists. It is strange to test a program in a way that it will not be used by the consumer, much less intended by its designer for real competitive play. The usefulness of the lists, while accepted by almost everyone, cannot be considered accurate to each programs design. People are essentially looking at inaccurate results, with most apparently not realizing it.
We are not giving any program a UL listing. The fact is this: we are testing the chess engine, not the chess program.
orgfert wrote:
What about the design goal of the designer? How about testing the designs? CCRL-type testing looks exactly like taking a race cars and removing their engines and just testing the engines in a lab, as though nobody cares about the transmission, suspension or aerodynamics. It's like saying that racing is all about engines.
Start testing all the bells and whistles yourself.
Adam Hair wrote:And there are a lot of engines out there, many being updated and new engines arriving each month. The
CCRL has been trying to test as many of them as possible. This goal may be at odds with what you would like to see done.
It has been helpful to others.
Are you also caught up with the notion that the CCRL is some kind of accreditation organization?
If we were, then our tests should include all design elements. Well, we are not and do not pretend to be.
orgfert wrote:
I've no such notions. It might be one thing if this was but one of several kinds of lists. But for it to be the principle bellwether for chess AI, while very well intentioned, is nevertheless a Wrong Thing.
Wrong Thing: n. A design, action, or decision that is clearly incorrect or inappropriate. Often capitalized; always emphasized in speech as if capitalized. The opposite of the Right Thing; more generally, anything that is not the Right Thing. In cases where ‘the good is the enemy of the best’, the merely good — although good — is nevertheless the Wrong Thing. “In C, the default is for module-level declarations to be visible everywhere, rather than just within the module. This is clearly the Wrong Thing.”
You certainly feel strong about this. However, the strength of your convictions does not determine whether you are
right or wrong about an issue.