OpenChess

Posted: **Sun Mar 31, 2013 9:18 am**

It seems that there is various klutzy information about LOS around. Some of it is good, some of it good but misleading...

Firstly, LOS is ill-defined

As Lucas Braesch pointed out somewhere (I think), one needs to assume an underlying Elo distribution of the engines that are being observed. [Rémi Coulom also cleared up this point to me awhile back in an email concerning BayesElo and its underlying model]. Most likely, I think that "reasonable" assumptions should lead to numbers close to those seen in tables, but I'll continue anyway.

Having passed this hurdle, one can now write down the tautology that
Prob[X is better than Y given Observation O of X vs Y]
is merely the integral from 0 to infinity of
Prob[X-Y is t Elo with D draw rate] times Prob[Observation O occurs given that X-Y is t Elo with D draw rate]
divided by the same t-integral from -infinity to infinity.

What needs to be stressed is that the second factor can be computed exactly (trinomial distribution), while the first is guesswork. [One could use score% instead of Elo, this is just a reparametrisation of the t-line -- one could alternatively take the first factor to be a measure against which one integrates].

For instance, in a self-testing framework, testing X against Y where the latter is a slight patch (or numerical tuning), one might reasonably guess that the "typical" Elo distribution of such patches is (say) Gaussian with a standard deviation of 3 Elo, and that the draw rate is 60%. [Again as I think Lucas has pointed out, the main impact of the draw rate concerns the proper accounting of the number of games played in the Observation; also, to be pedantic, the draw rate should additionally depend slightly on the Elo difference].

Worked example: In the above framework, X versus Y over 10000 games is observed to be 2060 wins, 1887 losses, 6053 draws, or 6.01 Elo. What is the LOS?

First I reparametrise the t-line in terms of sigma.
Then the first factor in the integral, Prob[X is sigma*3 Elo better than Y], is just M(s)=1/sqrt(2*Pi)*exp(-s^2/2).

The t Elo differential is a score percentage of E(t)=1-1/(1+10^(t/400)).
We have E(t)=W(t)+D(t)/2, where D(t)=0.6 and W(t)+L(t)=0.4, so that W(t)=E(t)-0.3 and L(t)=0.7-E(t).

One then puts W,L,D into the trinomial distribution, over 10000 games:
W(t)^2060*L(t)^1887*D(t)^6053*10000!/2060!/1887!/6053!
Call this R(t) I guess, though one has t=3*sigma in the reparametrisation.

One can truncate the s-integral to (say) 0 to 5, as rarer events have little impact.
Integrating R(s)*M(s) numerically and taking the quotient as indicated gives an LOS of 98.67%, if I made no math errors.

My understanding of the quick-and-dirty method is that one has 2060-1887 over 3947 decisive games, and via binomials or erfs one gets something like 99.7%. [The CPW page seems to be wrong in its definition of "x" in the LOS formula -- it gives 173/3947, and I think it should be 173/sqrt(3947)?].

To try to understand the difference between two calculations, essentially mine makes a "correction" via the assumed underlying Gaussian, that some of the more extreme events are more likely to be luck-induced than in the other model (e.g., I try to take into account that the observed 6 Elo might be 2 Elo of reality and 4 Elo of luck, or 1 Elo of reality and 5 Elo of luck, etc., and I think I [secretly] weight these various possibilities differently than the other method).

Posted: **Sun Mar 31, 2013 1:11 pm**

It's nice to see someone who knows what he's talking about. It's amazing the amount of people who write utter non sense about statistics on these forums, spreading FUD (Fear Uncertainly Doubt).

Typically people use a "LOS formula" that basically uses Gaussian distribution, with the empirical mean and empirical variance. But they FORGET all the CRUCIAL hyphotesis that are behind it:

1/ No early stopping. If N is not a random variable, but a predetermined number, and is "large enough" (large enough actually depends on the quantile we're looking at), then yes, the Central Limit Theorem approximation is fine.

2/ What about the mean and variance ? We don't know either of these, but we are merely observing a REALISATION of the empirical mean and the empirical variance. The empirical mean and variance are random variables too, and they have their variance too...

I'm just talking about the non Bayesian model here. The simple "standard" approach. But already in that simple approach, almost everyone gets it wrong.

Posted: **Sun Mar 31, 2013 2:14 pm**

If I know anything, it is because Rémi explained it to me after I was confused by some results produced by BayesElo.

I might point that when one increases the sigma in my assumed Gaussian distribution (as long as one stays within the 60% draw limit), my "98.67%" LOS in the worked example does indeed seem tend to something around the binomial/erf value of 99.7% For instance, with sigma as 5 Elo (rather than 3 Elo) it is already 99.4%, and at 7 Elo it is 99.6%, quite close to the expected. Maybe I should have made the pedagogical example end up around 95% LOS, so that the effect of this "sigma-window" of the underlying distribution would be more pronounced and/or relevant.

So for the typical CCRL batch of engines (which sprawl all over the Elo and draw ratio axes), the LOS stats that they state might be close to reasonable. But one cannot simply copy over those LOS numbers when dealing with a different testing model (such as trying to measure 3 Elo differences). LOS is not LOS is not LOS, at least until the underlying model is given. Or put another way, the fact that you "know" that your 5 lines of code fiddling can't be worth +12 Elo reality with -6 Elo luck means that this gets essentially weight 0 in the model, whereas for cross-engine testing a real difference of 12 Elo is commonplace and +12 reality with -6 luck is probably about as likely as +0 reality with +6 luck.

Now that I reread the fishcooking thread, I see that Joona Kiiski also pointed out that the model (Elo distribution of patches) needs to be taken into account, and suggested that a mean of -2 with a variance of 3 might be his first rather crude guess.

Posted: **Mon Apr 01, 2013 2:08 am**

Just to show the differences between LOS calculations:

Currently the Stockfish testing has one with 2724 wins, 2554 losses, 10704 draws. Using "standard" techniques on wins-minus-losses, one gets 99.5% LOS. Using my mean 0 deviation 3 Gaussian of patch distribution, I get 98.5% LOS. Or put differently, the Likelihood of Inferiority (LOI) is 3 in 200 instead of 1 in 200. And using Joona's mean -2 deviation 3 Gaussian, the LOS is 96.6%, or almost 7 in 200 chance of being bad.

Note that 10704/16000 is 66.9% draws, and if one changes the draw ratio in the model to 2/3, then LOS under my model increases to 98.9%, and under Joona's model becomes 97.6%.

Posted: **Mon Apr 01, 2013 7:50 am**

Here is a different calculation (this is no longer LOS exactly, but I won't start a new thread yet).

Suppose we have a pool of patches distributed Gaussianly according to 0 mean and 3 Elo deviation. We test these engines for 16K games at 62.5% draw rate and keep results that are observed to be worth at least 3 Elo. What is the acceptance percentage for patches? What is the "expected Elo gain" per patch tested?

I will simplify via approximations, saying that 10K of the games were draws, and the 3 Elo means 3069 wins in the 6000 decisive games. I will also use the normal approximation to the binomial distribution.

Then one integrates M(s)*P(s) where as before M(s)=1/sqrt(2*Pi)*exp(-s^2/2), while P(s) is the probability that at sigma=s we observe at least 3069 wins from 6000 decisive games. I get that we observe a patch to be worth at least 3 Elo about 19% of the time (this could be off by 1 or 2%, due to approximations). The "expected Elo gain" is a bit dodgy, as Elo is not linear, but with results near 50% it is nearly so. For this, we integrate Elo(s)*M(s)*P(s) where Elo(s) is here just 3*s. I get the result as 0.7 Elo per tested patch (or 3.7 per accepted patch).

Switching to the Kiiski patch distribution (-2 mean and 3 deviation), one finds 7.3% of the patches accepted, and a gain of 0.22 Elo per tested patch, or 2.97 Elo per accepted patch. At least if the maths are correct.

One can also vary the 3 Elo accept/reject barrier. For instance, in my patch distribution, when asking for at least 3100 wins rather than 3069, I get 10% patches accepted with 0.47 Elo per tested patch. Indeed, in my patch model, one should simply accept everything that scores at least 3000 wins, and get 1.05 Elo per tested patch. There's also the issue of test length: doing 10 times more work (60000 decisive games, or 160000 total games) only increases this to 1.18 Elo per tested patch, while doing 10 times less work (600 decisive games) only decreases it 0.59 Elo. By now the reader might be able to guess the "point", that for my patch distribution model, I think one optimises by ending the test after one decisive game (here, as elsewhere, I ignore colour), accepting/rejecting the patch based on this result. The idea is that there are so many "good" patches out there, it is more efficient to have a rather silly accept/reject rule, and to test as many patches as possible. OK, maybe this shows that my model of patch distribution is rather unrealistic.

Returning back to the Kiiski patch model, fixing the test length at 16000 games, it seems as though one maximises the Elo gain per patch test at 0.33 Elo by putting accept/reject at 3014 or 3015 wins, corresponding to about 0.65 Elo with the given draw ratio. Reducing the test length by a factor of 5 reduces the Elo gain down to 0.06, indicating that here (contrary to my patch model) one can not reduce the test length too much. It seems that a reduction of 2.5x or 3x, though, will give more Elo per unit time, about 0.53 Elo rather than 0.33 Elo. But as we've seen, everything depends on the patch model (particularly the availability of a large pool of decent patches to test -- if this is not the case, time is better spent ensuring that patches really are beneficial), there is cross-validating at longer time controls, there is the question of early stopping, etc.

Posted: **Mon Apr 01, 2013 10:34 am**

Just to give some idea about how one can have a priori information about the possible Elo gain of a change, take (for instance) an attempt to prune more aggressively. By taking some statistic from the data produced from the games (average search depth, though this might not be the optimal way to measure it), one might find that the old version averages 16.50 ply, and the new one 16.60 ply. This is 0.1 ply more, and if one guesses that reaching a whole ply deeper (on average) is worth something like 70 Elo, then the change here pretty much can't be worth more than 10 Elo (given that the pruning conditions shouldn't "help" the search other than to make it go deeper). If one further considers that this increase of pruning should cause additional mistakes to be made, one might diminish this upper bound even a bit more.

Posted: **Wed Apr 03, 2013 3:43 am**

I have looked at various models of Elo distributions of patches. For instance, one might use the Skew Normal Distribution. Suppose we use this with a sigma of 3 Elo and a skew of -1 (this skew seems most reasonable to me after playing around some). The mean is about -0.56 sigma, or -1.68 Elo. The x-axis in the picture is Elo.

Note that the skew normal distribution still keeps the same amount of mass in any symmetric interval, for instance, approximately 95.45% of the mass still lies between -2 and 2 sigma. This might not be all that realistic.

One issue is that of "really good" patches. Note that approximately 1/40 of the above distribution lies above 1 sigma. That is, about 1/40 of the submitted patches are worth 3 or more Elo in reality. This is a rather slim tail. However, a full 25% of the patches are positive Elo, that is, at least 0 sigma. The large number of positive-but-not-by-much patches leads me to question the value of attempting to isolate "really good" patches, though the analysis below seems to indicate that it could be useful.

Assuming as above that we are going to play a fixed number X of decisive games and accept/reject based upon whether Y games are won (and again colour might be considered for additional rigour), with a draw rate of 62.5% as before, I get that one wants Y to be about 50.9% of X and X to be around 2000. This gains about 0.54 Elo per 10000 decisive games (or per 5 tests). Note that winning 50.9% of the decisive games with a 62.5% draw ratio corresponds to 16K games with 3054 wins, 2946 losses, 10000 draws, which is 2.3 Elo. This then is the "bar" -- a patch has to be observed to gain essentially 2.3 Elo to be accepted.

Another point here is that [for the ranges of interest] the larger one makes Y, the smaller one wants X to be (again, assuming an infinite pool of viable patches lying around). That is, a higher bar should use a shorter test to be maximally efficient. For instance, taking the bar to be quite low, say 0.5 Elo or 50.2% of the decisive games, one gains 0.42 Elo per 10000 decisive games with X around 3500-4000, but lowering the runs to 2000 games (as with a 2.3 Elo bar) reduces this to 0.26 Elo per 10000 decisive games. Similarly, by taking X to be 10000 with Y as 5090, one only gains 0.17 Elo per 10000 decisive games, decreasing the efficiency by a factor of 3. [In effect, one is maximising a 2-variable problem in X and Y/X, and after fixing Y/X one is left with choosing the maximising X].

However, the difference in the respective theoretical Elo rates for the different "bars", 0.54 versus 0.42, seems quite likely to be within the bounds of any experimental model, so the exact methodology used probably doesn't matter that much in the end. One can also of course repeat the calculation for larger/smaller sigma or skew.

One can note that the "0.54 Elo" per 10000 decisive games is actually split up as +0.86 from positive changes and -0.32 from negative ones. The acceptance rate of patches is about 15% [this is the case even though only 5% of patches are in reality worth the "bar" of 2.3 Elo, as various patches that are worse do sufficiently well over the tests of 2000 decisive games -- if one runs the tests for 10K decisive games, the acceptance rate drops to 7%, but as above, the overall Elo efficiency suffers by a factor of 3x, at least under the assumptions made here].

Posted: **Wed Apr 03, 2013 5:21 am**

Here are some tables concerning the expected Elo gain per 10000 decisive games under the above hypotheses. I did the computation for both skew -1 and skew -1.5. One specifies a desired observed win% (or observed Elo) as the accept/reject bar, then maximises Elo gain vis-a-vis number of games to be played. ["dgames" means decisive games].

Skew -1:
50.1% (0.25 Elo): play 3900 dgames, expect 0.39 Elo per 10K dgames
50.2% (0.50 Elo): play 3500 dgames, expect 0.43 Elo per 10K dgames
50.3% (0.75 Elo): play 3200 dgames, expect 0.46 Elo per 10K dgames
50.4% (1.00 Elo): play 2900 dgames, expect 0.48 Elo per 10K dgames
50.5% (1.25 Elo): play 2700 dgames, expect 0.50 Elo per 10K dgames
50.6% (1.50 Elo): play 2400 dgames, expect 0.52 Elo per 10K dgames
50.7% (1.75 Elo): play 2300 dgames, expect 0.53 Elo per 10K dgames
50.8% (2.00 Elo): play 2100 dgames, expect 0.54 Elo per 10K dgames
50.9% (2.25 Elo): play 2000 dgames, expect 0.54 Elo per 10K dgames
51.0% (2.50 Elo): play 1800 dgames, expect 0.54 Elo [max] per 10K dgames
51.1% (2.75 Elo): play 1700 dgames, expect 0.54 Elo per 10K dgames
51.2% (3.00 Elo): play 1600 dgames, expect 0.53 Elo per 10K dgames
51.3% (3.25 Elo): play 1500 dgames, expect 0.53 Elo per 10K dgames
51.4% (3.50 Elo): play 1400 dgames, expect 0.53 Elo per 10K dgames
51.5% (3.75 Elo): play 1300 dgames, expect 0.52 Elo per 10K dgames
51.6% (4.00 Elo): play 1300 dgames, expect 0.51 Elo per 10K dgames
51.7% (4.25 Elo): play 1200 dgames, expect 0.50 Elo per 10K dgames
51.8% (4.50 Elo): play 1100 dgames, expect 0.49 Elo per 10K dgames
51.9% (4.75 Elo): play 1100 dgames, expect 0.48 Elo per 10K dgames
52.0% (5.00 Elo): play 1000 dgames, expect 0.47 Elo per 10K dgames

Skew -1.5:
50.1% (0.25 Elo): play 7800 dgames, expect 0.12 Elo per 10K dgames
50.2% (0.50 Elo): play 6800 dgames, expect 0.13 Elo per 10K dgames
50.3% (0.75 Elo): play 6000 dgames, expect 0.14 Elo per 10K dgames
50.4% (1.00 Elo): play 5300 dgames, expect 0.15 Elo per 10K dgames
50.5% (1.25 Elo): play 4800 dgames, expect 0.16 Elo per 10K dgames
50.6% (1.50 Elo): play 4300 dgames, expect 0.16 Elo per 10K dgames
50.7% (1.75 Elo): play 3900 dgames, expect 0.16 Elo [max] per 10K dgames
50.8% (2.00 Elo): play 3600 dgames, expect 0.16 Elo per 10K dgames
50.9% (2.25 Elo): play 3300 dgames, expect 0.16 Elo per 10K dgames
51.0% (2.50 Elo): play 3000 dgames, expect 0.16 Elo per 10K dgames
51.1% (2.75 Elo): play 2800 dgames, expect 0.16 Elo per 10K dgames
51.2% (3.00 Elo): play 2600 dgames, expect 0.15 Elo per 10K dgames
51.3% (3.25 Elo): play 2400 dgames, expect 0.15 Elo per 10K dgames
51.4% (3.50 Elo): play 2300 dgames, expect 0.14 Elo per 10K dgames
51.5% (3.75 Elo): play 2100 dgames, expect 0.14 Elo per 10K dgames
51.6% (4.00 Elo): play 2000 dgames, expect 0.13 Elo per 10K dgames

One can see that there is a rather large "sweet spot" of parameters, but things do fall off when you make sufficiently inoptimal choices (like playing too many games) The difference between skew -1 and skew -1.5 is evident, both in expected Elo gain, and in number of games one should play.

Additionally (though not shown above), one can vary the assumption that sigma is 3 Elo. For instance, reducing this to 2 Elo, when the skew is -1 one wants to play 4000 decisive games [almost double the previous] with a accept/reject of 50.7% (1.75 Elo), gaining 0.16 Elo per 10K decisive games. Note that 0.16 is much less than 0.54, showing that the positive effects of a large sigma can definitely be seen. When the skew is -1.5 and sigma is 2, detecting positive patches becomes quite difficult, maybe 0.05 Elo per 10K dgames.

From all the above posts, one thing that I have concluded is that figuring out the distribution of your patches is of high significance, and should probably precede the discussion of tuning a testing protocol. One major assumption that the above analyses make is that one has an infinite supply of patches, that is, the pool does not degrade over time as useful items are taken from it. In a real environment, one might also have something like validation testing at different time controls, etc., when (for instance) you might want to know how well results at time control X predict results at time control Y.

Here is the PARI/GP script I used to generate the numbers:

erf(x)=1-erfc(x);
CUMUL(x)=1/2*(1+erf(x/sqrt(2)))
SKEW_DENSITY(a,x)=2*CUMUL(a*x)*1/sqrt(2*Pi)*exp(-x^2/2);
U(s)=3*s; \\ sigma is 3 Elo
E(s)=1-1/(1+10^(U(s)/400)); \\ Elo to results via logistic
D(s)=0.625;
W(s)=E(s)-D(s)/2;
L(s)=1-D(s)/2-E(s);
T(s)=W(s)/(W(s)+L(s));
mu(s,G)=G*T(s);
var(s,G)=mu(s,G)*(1-T(s));
X(G,s,w)=(w-mu(s,G))/sqrt(var(s,G));
normal(G,s,w)=1/2*(1+erf(X(G,s,w)/sqrt(2)));
Observ(G,s,w)=1-normal(G,s,w);

{ExpectedEloRate(G,winp,alpha)=local();
 w=G*winp; 10000/G*intnum(s=-5,5,U(s)*Observ(G,s,w)*SKEW_DENSITY(alpha,s));}

For instance, ExpectedEloRate(2000,0.509,-1) should return 0.539599...

Posted: **Wed Apr 03, 2013 8:51 pm**

Thanks for the pari script. This looks useful

Posted: **Thu Apr 04, 2013 2:30 am**

I currently think one of the main flaws in the above type of analysis (compute expected Elo per test, and size the tests accordingly) is that assumes there are a large number of viable patches. This might be true for a program in its infancy, but after the program matures a bit, it becomes less true. For instance, one can presumably test about 1000 patches a year even with a fairly "long" time control approaching 1 min/game. I don't think one should reasonably expect a top engine to gain something like 200 Elo (0.2 per tested patch) from this though.

When one curtails this large number of viable patches, then it becomes more important not to introduce regressions (recall that "0.54 Elo" gain per 10K dgames was 0.86-0.32 above), which can be done via instrumenting tests with more games. Again one needs to model the situation relevant to a given engine to be able to speculate a mathematical answer. [And as I say, there is some fair leeway in choice of parameters which keep one near the theoretical maximum].

I think the below is reasonable protocol for attempting to optimise a testing system:
1) Determine (or make a guess) on what the Elo distribution of patches should be, and how this might change over time.
2) Think about step 1 very carefully, as it can have a definite impact on the optimal parameters.
3) Use some formula involving LOS(ObservedResults,PatchEloDistribution) to compute optimal parameters for a testing methodology (depending on your setup, this methodology might either be a fixed number of games [distributed workers that don't communicate easily] or some more complicated criterion).
3a) Take colour, draw ratio, and maybe opening book into account in the above.
4) Argue about the relevance of #3, particularly for self-testing versus gauntlets, short time controls versus long,...

Of course, most engines can improve by a decent margin w/o optimising the testing system. Maybe you are only 60-70% as efficient as you could be, but I would still think that overall, being able to design viable patches is probably more important. And some engine authors would prefer to implement things like learning, tablebases, opening book, personalities, etc., in any case.

OpenChess

LOS

LOS

Re: LOS

Re: LOS

Re: LOS

Re: LOS

Re: LOS

Re: LOS

Re: LOS

Re: LOS

Re: LOS