In my reading of the paper, I don't feel this is really like biological / spikin...

In my reading of the paper, I don't feel this is really like biological / spiking networks at all. They keep a running history of inputs and use multi-headed attention to form an internal model of how the past "pre-synaptic" inputs factor into the current output (post-synaptic). this is just like a modified transformer (keep history of inputs, use attention on them to form an output).

Then the "synchronization" is just using an inner product of all the post activations (stored in a large ever-growing list and using subsampling for performance reasons).

But its still being optimized by gradient descent, except the time step at which the loss is applied is chosen to be the time step with minimum loss, or minimum uncertainty (uncertainty being described by the data entropy of the output term).

I'm not sure where people are reading that this is in any way similar to spiking neuron models with time simulation (time is just the number of steps the data is cycled through the system, similar to diffusion model or how LLM processes tokens recursively).

The "neuron synchronization" is also a bit different from how its meant in biological terms. Its using an inner product of the output terms (producing a square matrix), which is then projected into the output space/dimensions. I suppose this produces "synchronization" in the sense that to produce the right answer, different outputs that are being multiplied together must produce the right value on the right timestep. It feels a bit like introducing sparsity (where the nature of combining many outputs into a larger matrix makes their combination more important than the individual values). The fact that they must correctly combine on each time step is what they are calling "synchronization".

Techniques like this are the basic the mechanism underlying attention (produce one or more outputs from multiple subsystems, dot product to combine).