My understanding is they are all still transformers. The tweaks are more about quantization that better to generalize over data more efficiently (so less parameters requires) and improvement of the training data/process itself.
Otherwise I'd like to know specifically whats better/improved between models themselves.
Otherwise I'd like to know specifically whats better/improved between models themselves.