Chris Whittington wrote:
The above just confirms the point ....
you are hill climbing, together with a bunch of relatively similar machines, all competing against he same metric (statistical win rate, aka ELO)
there's no guarantee, and in fact it is very likely, that the hill being climbed is no way the highest hill around, and when you get to the top, or near the top, there's no way to jump over onto another higher hill to repeat the process, because, as you say "quite often a "fix" that improved the play in a single game we were examining would cause a drastic drop in Elo overall" which has the effect of keeping you on the same hill.
In fact, I think a case can be made that your ELO can continue to rise through the methodology used even when you got to the top of the (non-optimal) hill already and there's nowhere higher to be gone to.
The alternative is witchcraft/voodoo/magic/etc. I personally believe that with a large set of opening positions, and opponents that are stronger than me, so long as I can close the gap I am getting better overall. This is a far sounder assumption than trying to look at a specific game, isolate a particular move, and adjust either the search or evaluation to choose a better move. Been there. Done that. Got the T-shirt. It is a flawed methodology.
As I mentioned, we tried a few of these early on, just to get a feel for what we _had_ been doing. I would see a game played on ICC where I could analyze and determine some point where a losing move was made. And with (sometimes) some GM help, we'd look at the good move vs the bad move, and try to determine if it was depth or knowledge that caused the error. And for the normal cases, after we came up with a fix that made us switch from the bad move to the good one, cluster testing would often show that the "fix" hurt overall. It is _very_ difficult to envision how a change in the evaluation for this position will effect all the other similar but subtly different positions we have to play through.
We often find new ideas by looking at individual games, but this is usually in the form of "we are just not evaluating this very well" or "we have no term that attempts to quantify this particular positional concept". But as we fix those things, we don't just use the game where we made a boo-boo, we play 30,000 games to make sure that it helps in more cases than it hurts. Which guarantees upward progress. I had way too many steps backward with crafty prior to cluster-testing. Others seem to be doing the same, although with different approaches. Rybka apparently plays about 40,000 game in 1 second games to tune things. I prefer to occasionally vary the time control to make sure that something that helps at fast games doesn't hurt at slow games.
But the point is that this is an objective mechanism, not subjective. Lots of "good ideas" have been tossed out because even though they sounded reasonable, we could not find any implementation that didn't hurt overall results. I like the idea of making a change, then running a quick test and in an hour have a really good idea of whether the idea as implemented worked or not. If not, we try to figure out why as on quite a few occasions, the idea was good, but the implementation had a bug. Humans think too highly of their subjective abilities. I've drifted away from that approach after proving over and over that my subjective opinion was quite a bit away from the real truth.
Is it possible to reach a local maxima? Of course. But we are not doing automated tuning, we are making changes and testing the resulting programs. Which means that as a human, we can recognize a trend that needs attention and do something about it, even if it requires a complete rewrite of something, such as king safety or pawn structure or whatever.