LOS
Posted: Sun Mar 31, 2013 9:18 am
It seems that there is various klutzy information about LOS around. Some of it is good, some of it good but misleading...
Firstly, LOS is ill-defined As Lucas Braesch pointed out somewhere (I think), one needs to assume an underlying Elo distribution of the engines that are being observed. [Rémi Coulom also cleared up this point to me awhile back in an email concerning BayesElo and its underlying model]. Most likely, I think that "reasonable" assumptions should lead to numbers close to those seen in tables, but I'll continue anyway.
Having passed this hurdle, one can now write down the tautology that
Prob[X is better than Y given Observation O of X vs Y]
is merely the integral from 0 to infinity of
Prob[X-Y is t Elo with D draw rate] times Prob[Observation O occurs given that X-Y is t Elo with D draw rate]
divided by the same t-integral from -infinity to infinity.
What needs to be stressed is that the second factor can be computed exactly (trinomial distribution), while the first is guesswork. [One could use score% instead of Elo, this is just a reparametrisation of the t-line -- one could alternatively take the first factor to be a measure against which one integrates].
For instance, in a self-testing framework, testing X against Y where the latter is a slight patch (or numerical tuning), one might reasonably guess that the "typical" Elo distribution of such patches is (say) Gaussian with a standard deviation of 3 Elo, and that the draw rate is 60%. [Again as I think Lucas has pointed out, the main impact of the draw rate concerns the proper accounting of the number of games played in the Observation; also, to be pedantic, the draw rate should additionally depend slightly on the Elo difference].
Worked example: In the above framework, X versus Y over 10000 games is observed to be 2060 wins, 1887 losses, 6053 draws, or 6.01 Elo. What is the LOS?
First I reparametrise the t-line in terms of sigma.
Then the first factor in the integral, Prob[X is sigma*3 Elo better than Y], is just M(s)=1/sqrt(2*Pi)*exp(-s^2/2).
The t Elo differential is a score percentage of E(t)=1-1/(1+10^(t/400)).
We have E(t)=W(t)+D(t)/2, where D(t)=0.6 and W(t)+L(t)=0.4, so that W(t)=E(t)-0.3 and L(t)=0.7-E(t).
One then puts W,L,D into the trinomial distribution, over 10000 games:
W(t)^2060*L(t)^1887*D(t)^6053*10000!/2060!/1887!/6053!
Call this R(t) I guess, though one has t=3*sigma in the reparametrisation.
One can truncate the s-integral to (say) 0 to 5, as rarer events have little impact.
Integrating R(s)*M(s) numerically and taking the quotient as indicated gives an LOS of 98.67%, if I made no math errors.
My understanding of the quick-and-dirty method is that one has 2060-1887 over 3947 decisive games, and via binomials or erfs one gets something like 99.7%. [The CPW page seems to be wrong in its definition of "x" in the LOS formula -- it gives 173/3947, and I think it should be 173/sqrt(3947)?].
To try to understand the difference between two calculations, essentially mine makes a "correction" via the assumed underlying Gaussian, that some of the more extreme events are more likely to be luck-induced than in the other model (e.g., I try to take into account that the observed 6 Elo might be 2 Elo of reality and 4 Elo of luck, or 1 Elo of reality and 5 Elo of luck, etc., and I think I [secretly] weight these various possibilities differently than the other method).
Firstly, LOS is ill-defined As Lucas Braesch pointed out somewhere (I think), one needs to assume an underlying Elo distribution of the engines that are being observed. [Rémi Coulom also cleared up this point to me awhile back in an email concerning BayesElo and its underlying model]. Most likely, I think that "reasonable" assumptions should lead to numbers close to those seen in tables, but I'll continue anyway.
Having passed this hurdle, one can now write down the tautology that
Prob[X is better than Y given Observation O of X vs Y]
is merely the integral from 0 to infinity of
Prob[X-Y is t Elo with D draw rate] times Prob[Observation O occurs given that X-Y is t Elo with D draw rate]
divided by the same t-integral from -infinity to infinity.
What needs to be stressed is that the second factor can be computed exactly (trinomial distribution), while the first is guesswork. [One could use score% instead of Elo, this is just a reparametrisation of the t-line -- one could alternatively take the first factor to be a measure against which one integrates].
For instance, in a self-testing framework, testing X against Y where the latter is a slight patch (or numerical tuning), one might reasonably guess that the "typical" Elo distribution of such patches is (say) Gaussian with a standard deviation of 3 Elo, and that the draw rate is 60%. [Again as I think Lucas has pointed out, the main impact of the draw rate concerns the proper accounting of the number of games played in the Observation; also, to be pedantic, the draw rate should additionally depend slightly on the Elo difference].
Worked example: In the above framework, X versus Y over 10000 games is observed to be 2060 wins, 1887 losses, 6053 draws, or 6.01 Elo. What is the LOS?
First I reparametrise the t-line in terms of sigma.
Then the first factor in the integral, Prob[X is sigma*3 Elo better than Y], is just M(s)=1/sqrt(2*Pi)*exp(-s^2/2).
The t Elo differential is a score percentage of E(t)=1-1/(1+10^(t/400)).
We have E(t)=W(t)+D(t)/2, where D(t)=0.6 and W(t)+L(t)=0.4, so that W(t)=E(t)-0.3 and L(t)=0.7-E(t).
One then puts W,L,D into the trinomial distribution, over 10000 games:
W(t)^2060*L(t)^1887*D(t)^6053*10000!/2060!/1887!/6053!
Call this R(t) I guess, though one has t=3*sigma in the reparametrisation.
One can truncate the s-integral to (say) 0 to 5, as rarer events have little impact.
Integrating R(s)*M(s) numerically and taking the quotient as indicated gives an LOS of 98.67%, if I made no math errors.
My understanding of the quick-and-dirty method is that one has 2060-1887 over 3947 decisive games, and via binomials or erfs one gets something like 99.7%. [The CPW page seems to be wrong in its definition of "x" in the LOS formula -- it gives 173/3947, and I think it should be 173/sqrt(3947)?].
To try to understand the difference between two calculations, essentially mine makes a "correction" via the assumed underlying Gaussian, that some of the more extreme events are more likely to be luck-induced than in the other model (e.g., I try to take into account that the observed 6 Elo might be 2 Elo of reality and 4 Elo of luck, or 1 Elo of reality and 5 Elo of luck, etc., and I think I [secretly] weight these various possibilities differently than the other method).