Page 1 of 1
Engine Testing
Posted: Fri Jul 02, 2010 1:37 am
by JCoit
When testing an engine, is it best to test/tune new versions against old versions or is it better to test/tune against a variety of engines (which may include older versions of the engine being tested)?
It seems to me that testing and tuning for the purpose of increasing ELO should involve a variety of other engines because at the end of the day, it doesn't matter as much how an engine performs against itself as much as how it performs against other engines.
Re: Engine Testing
Posted: Fri Jul 02, 2010 1:57 am
by Uly
Against itself.
I think one has to separate improvement of elo from improvement of strength, in the former you maximize the number that appears on the rating lists, on the latter you make sure that you are playing the best moves.
For instance, Rybka 3 with contempt play 15 (the default) was made this way to maximize elo, even though it was playing worse moves than contempt 0, and indeed few days before Rybka 4 was released, Stockfish topped her for a few day, since it did not fall on the "traps" that Rybka 3 C15 was playing. That wouldn't have happened to contempt 0, and at the end it would have had a higher elo than contempt 15 (that was optimized for the opposition of Rybka 3's release time).
Optimizing your engine to beat weaker opposition will increase your elo, but you may be weakening the engine, if it loses against a previous version it's probably because it's playing worse moves.
Re: Engine Testing
Posted: Fri Jul 02, 2010 2:10 am
by hyatt
I completely disagree, from tons of experience. Testing A' against A (A' is a small change to A) produces results that are sometimes exaggerated since the only difference between the two programs is that one new feature, making their Elo _very_ close and requiring a ton of games to get the error bar down.
I've done both types of testing on our cluster. A vs A' and A and A' vs a common gauntlet of opponents. I have encountered (and reported in CCC in recent years) cases where A' beats A, but is overall worse than A when compared to the gauntlet. I have encounterd cases where A' looked worse, but was better against the gauntlet. It easy enough to make a change and cause A' to beat A by a small margin. But against other opponents, that change might suddenly make you play in a way that your original version A could not take advantage of, but other programs can. It is a form of genetic inbreeding when you think about it.
Note that unless you are at the top of the rating list, you don't have to optimize your program against weaker opponents. There are some that are better. Stockfish is strong enough to be useful for testing by anyone, Rybka included.
Re: Engine Testing
Posted: Fri Jul 02, 2010 4:41 am
by zwegner
JCoit wrote:When testing an engine, is it best to test/tune new versions against old versions or is it better to test/tune against a variety of engines (which may include older versions of the engine being tested)?
It seems to me that testing and tuning for the purpose of increasing ELO should involve a variety of other engines because at the end of the day, it doesn't matter as much how an engine performs against itself as much as how it performs against other engines.
I think both methods suck.
Re: Engine Testing
Posted: Fri Jul 02, 2010 5:12 am
by hyatt
zwegner wrote:JCoit wrote:When testing an engine, is it best to test/tune new versions against old versions or is it better to test/tune against a variety of engines (which may include older versions of the engine being tested)?
It seems to me that testing and tuning for the purpose of increasing ELO should involve a variety of other engines because at the end of the day, it doesn't matter as much how an engine performs against itself as much as how it performs against other engines.
I think both methods suck.
If you knock what works, you have to explain a better approach.
"running a gauntlet" has brought us a _long_ way in a short time...
Re: Engine Testing
Posted: Fri Jul 02, 2010 5:53 am
by zwegner
hyatt wrote:zwegner wrote:JCoit wrote:When testing an engine, is it best to test/tune new versions against old versions or is it better to test/tune against a variety of engines (which may include older versions of the engine being tested)?
It seems to me that testing and tuning for the purpose of increasing ELO should involve a variety of other engines because at the end of the day, it doesn't matter as much how an engine performs against itself as much as how it performs against other engines.
I think both methods suck.
If you knock what works, you have to explain a better approach.
"running a gauntlet" has brought us a _long_ way in a short time...
Oh, don't get me wrong, I don't have any better way. I just hate all methods of testing.
They have certainly worked for me too, but not without their share of headaches.
Anything you do will be based on some pretty big statistical assumptions. One big one is that strength is a one-dimensional quantity, which is most definitely not true (and I have run into this many times during testing). In fact this whole thread is based on how to best ignore the non-linearity of strength.
Re: Engine Testing
Posted: Fri Jul 02, 2010 6:58 am
by Chan Rasjid
Are you people sure you have all not got all these testing thing all wrong
I have been using a very powerful way of testing that I am very certain about
- play 2 games, black then white, 1min + 0 sec against a much stronger opponent and the result is taken - changes are good if 2-0 and bad if 0-2.
An elo of +30 after 30,000 games means the factor and/or changes should be discarded - that we are not looking at the right stuff.
BB+ mentioned that, likely, what Rybka introduced in computer chess is testing. Maybe it could be about first locating the right ingredients and then comes the (2 games) testing.
Rasjid
Re: Engine Testing
Posted: Fri Jul 02, 2010 7:13 am
by Richard Vida
JCoit wrote:When testing an engine, is it best to test/tune new versions against old versions or is it better to test/tune against a variety of engines (which may include older versions of the engine being tested)?
I use both ways. When I test changes in the search, I run a head to head match against previous version. When the change seems good but with very tiny margin then I run a gauntlet against a variety of opponents.
Changes in the eval I test only with the gauntlet.
Richard
Re: Engine Testing
Posted: Fri Jul 02, 2010 7:37 am
by Mincho Georgiev
I think running gauntlets it's preferable over A vs A', but I don't do them either.
I'm running separate matches vs 3-4 opponents. Every opponent is the same as previous testing. Every opponent is assigned to be tested under different time control, but the difference in time controls is no more that 5 x the smaller one.
Gathering all the data at the end and making some assumptions.
If I'm sure that there was a bug fixed or I make an optimization, then I can afford to run A vs A' before repeating the above exercise.
Of course, my testing method could be a complete bull...t, but so far it worked.
In addition, I hate testing myself especially due to the lack of decent hardware.