Bradley-Terry and Elo scores are equivalent mathematical models! The fundamental presumption is the same Thurstone model - that an individual's skill in a particular game is a normally distributed random variable around their fundamental skill.
We did experiment with a Bradley-Terry loss function (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw), but we found that even better was to calculate Elo scores, do cross-query bias adjustment, and then MSE loss to predict the Elo score itself.
->Bradley-Terry and Elo scores are equivalent mathematical models!
No, they are not equivalent mathematical models, they are equalivant in terms of calculation of score function(logistic) given equivalent scale factors. Such that, Bradley-terry: 1/(1 + e^(x(r_B - r_A))) and Elo rating: 1/(1 + 10^((r_B - r_A)/y)), then equivalance requires x = ln(10)/y. More importantly, Elo rating is online scoring system, meaning it takes into accoun the sequence of the events. From your blog post, I understand that you are not updating the scores after after each event. In other words, Elo rating can be interpreted as an incremental fitting of a Bradley-Terry (using similar logistic) model but not the same!
-> The fundamental presumption is the same Thurstone model
The Thurstone model is similar, and as you said it assumes normal (as opposed to logistic) using probit link function. It predates both models and due to computational constraints, you can call Bradley-Terry and Elo rating computationally convenient approximation of the Thurstone model.
-> We did experiment with a Bradley-Terry loss function (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw)
The math is correct. Thanks for sharing. Indeed, if you do it with incremental updating, you will lose the differentiability given the next winning probability is dependent on the previous updates. Call it what you want, but note that this is not truly and Elo rating which leads misunderstanding. It is Bradley-Terry given you do batch updates which you take extra steps to connect with Elo score, as shown in the link.
Lastly, normal and logistic distribution will lead to log(0) in evaluations which results inf in loss. As I can see from you upper comment, you try add uniform(0.02) as ad-hoc fix. An elegant fix to that is use heavy-tailed distribution such as Cauchy.