Tokenizers face an odd compute issue. Since they're part of the pre-processing p...

Tokenizers face an odd compute issue.

Since they're part of the pre-processing pipeline, you can't quickly test them out for effectiveness. You have to restart a pretraining run to test downstream effectiveness.

Separately,

As much as an attention module can do universal nonlinear transformations....I wonder if it makes sense to add specifuc modules for some math primitives as well. I remember that the executor paper [1] (slightly precursor to the attention is allyou need paper) created self contained modules for operations like less than, count, sum and then explicitly orchestrated them in the decoder.

I'm surprised we haven't seen such solutions produce sota results from math-ai or code-ai research communities.

[1] https://arxiv.org/abs/1705.03633