Some experimental "similarity" data
Posted: Wed Dec 29, 2010 9:07 pm
I've decided to stick this in a more technical subforum.
Here are the results of my experiment in "bestmove" matching, à la Don Dailey. I used fixed depth for a variety of reasons, notably that some engines screw up movetime, while others have polling behaviour that can sully any data in fast searches (with SMP another worry). While I much prefer the reproducibilty of 1-cpu fixed depth searches, I don't think that this should seen as a great advance in "scientific" methodology, however.
I formed a suite of 8306 positions. I did this by taking a few hundred games, and pruning all opening/endgame positions, then pruning those which had a move that was more than +0.25 (in a 1.0s search) above any others, and those for which the eval was more than 2.00 in size. Whether this is a good method to generate positions is an open matter, but hopefully it would give some control over strength issues.
Then I tested 10 engines at various depths. The determination of the proper "depth" is not a science, but I intended for it take between 2 and 4 hours to run (about 1 second per move). I restricted myself to the engine families listed below as with them I understand how to ensure that the data obtained are what is desired. [With some non-negligible but feasible effort, I could also completey isolate the evaluate() function in each of these if desired, so as to see if "bestmove" correlation and evaluate() correlation are themselves correlated].
Here are the bestmove-matching data:The table has the nice feature that it confirms all "known" interactions (and/or preconceived notions). For instance, Rybka 1 to Rybka 2.3.2a have an incredible match, while Rybka 3 still shares a lot with them, and IvanHoe shares much with R3 also. The Fruit 2.1 overlap with earlier Rybkas also seems apparent (though not quite so pronounced), and this looks to disappear in R3. The 95% confidence interval for any given correlation should be about ±100, so that (for instance) the Glaurung 2 correlation with Fruit 2.1 at 4330 is distinctly less than the Rybka 1.0 Beta correlations with Fruit 2.1 and Rybka 3 of 4551 and 4552 respectively. Again I note that it is not all that clear that strength issues have been adequately addressed. Fruit 2.1 did see a complete (re)write of the eval function of Fruit 1.0, but still it seems that Fruit 1.0 might be too weak to correlate well.
All data and programmes are in the attached 7zip archive, which is in a semi-usable form (for instance, I #define things to be 8306 in the C code, to concord with the data size). The DEPTH needs to be given at compile time, while the engine name can be given as a command-line option. As noted, the correlation data I obtained should be entirely reproducible, though it would likely be more useful to run a similar experiment on a different set of positions (possibly pruned as above). I would usually run this via commands like:where LINKS is a (sub)directory with links to the engines in question.
Here are the results of my experiment in "bestmove" matching, à la Don Dailey. I used fixed depth for a variety of reasons, notably that some engines screw up movetime, while others have polling behaviour that can sully any data in fast searches (with SMP another worry). While I much prefer the reproducibilty of 1-cpu fixed depth searches, I don't think that this should seen as a great advance in "scientific" methodology, however.
I formed a suite of 8306 positions. I did this by taking a few hundred games, and pruning all opening/endgame positions, then pruning those which had a move that was more than +0.25 (in a 1.0s search) above any others, and those for which the eval was more than 2.00 in size. Whether this is a good method to generate positions is an open matter, but hopefully it would give some control over strength issues.
Then I tested 10 engines at various depths. The determination of the proper "depth" is not a science, but I intended for it take between 2 and 4 hours to run (about 1 second per move). I restricted myself to the engine families listed below as with them I understand how to ensure that the data obtained are what is desired. [With some non-negligible but feasible effort, I could also completey isolate the evaluate() function in each of these if desired, so as to see if "bestmove" correlation and evaluate() correlation are themselves correlated].
Here are the bestmove-matching data:
Code: Select all
FR10 FR21 IH47 Ryb1 Ry12 R232 Ryb3 Gla2 SF15 SF19 Time
FR10.at.dp9 0 3920 3290 3529 3600 3581 3381 3876 3611 3528 3:36
FR21.at.dp10 3920 0 3927 4551 4478 4436 4064 4330 4248 4127 4:06
IH47c.at.dp15 3290 3927 0 4333 4423 4641 4921 3885 4370 4411 3:09
R1.at.dp10 3529 4551 4333 0 5523 5259 4552 4264 4408 4283 2:45
R12.at.dp11 3600 4478 4423 5523 0 5464 4638 4272 4468 4379 3:18
R232.at.dp11 3581 4436 4641 5259 5464 0 4840 4206 4454 4378 3:21
R3.at.dp10 3381 4064 4921 4552 4638 4840 0 4057 4434 4380 2:51
GL2.at.dp12 3876 4330 3885 4264 4272 4206 4057 0 4735 4365 2:41
SF151.at.dp13 3611 4248 4370 4408 4468 4454 4434 4735 0 5238 3:57
SF191.at.dp14 3528 4127 4411 4283 4379 4378 4380 4365 5238 0 2:35
All data and programmes are in the attached 7zip archive, which is in a semi-usable form (for instance, I #define things to be 8306 in the C code, to concord with the data size). The DEPTH needs to be given at compile time, while the engine name can be given as a command-line option. As noted, the correlation data I obtained should be entirely reproducible, though it would likely be more useful to run a similar experiment on a different set of positions (possibly pruned as above). I would usually run this via commands like:
Code: Select all
gcc -O3 -DDEPTH=\"11\" -o bestmove bestmove.c
time ./bestmove LINKS/Ryb232 < PRUNE.LIST > R232.at.dp11 &
[...]
./compare FR10.at.dp9 FR21.at.dp10 IH47c.at.dp15 R1.at.dp10 R12.at.dp11 \
R232.at.dp11 R3.at.dp10 GL2.at.dp12 SF151.at.dp13 SF191.at.dp14