The point seems to be, at least as currently implemented with Stockfish testing, the SPRT test computes the following (given the experimental data, including the draw ratio):
a) Probability that the Elo difference is -1.5 b) Probability that the Elo difference is +4.5
In Larry's case, both of (a) and (b) are essentially infinitesimal, so it is then a (much) higher order effect that needs to kick in to terminate the test [comparison of tails of size (10±epsilon) sigma or something]. What one might want instead is:
c) Probability that the Elo difference is -1.5 or worse d) Probability that the Elo difference is +4.5 or better
# alpha = max typeI error (reached on elo = elo0) # beta = max typeII error for elo >= elo1 (reached on elo = elo1)
# Probability laws under H0 and H1 P0 = bayeselo_to_proba(elo0, drawelo) P1 = bayeselo_to_proba(elo1, drawelo) # Log-Likelyhood Ratio result['llr'] = R['wins']*math.log(P1['win']/P0['win']) + R['losses']*math.log(P1['loss']/P0['loss']) + R['draws']*math.log(P1['draw']/P0['draw'])
One can simulate the "better/worse" conditions by (say) taking the probabilities for 4.5, 4.55, 4.6, 4.65, etc., and combining them appropriately (maybe integrating over an expected patch distribution, similar to a previous discussion), but with the Stockfish testing environment, where most patches seem reasonably close to the "target" window, the nuance between (a) and (b) versus (c) and (d) might not be that great.