It's pretty much the same architecture since GPT2, just a bunch of self-attentio...

It's pretty much the same architecture since GPT2, just a bunch of self-attention transformer blocks.

The reason these have been better is because we have more GPU, more data, and have scaled the attention calculations to be linear instead of quadratic, so we can train even bigger models. We've also been finetuning models on higher quality data.

To understand the orca papers you need to understand how models are trained.

Pretraining: this is when we train a model from scratch on all the data that we can get from the internet.

Finetuning: We further train the pretrained model on a specific style. For chat models this is called the instruction finetuning, this is where the model learns to respond in a specific format and align it to be helpful, etc. We do this by giving it a bunch of texts of assistants answering questions and being helpful.

Llama2-chat is a finetune of llama2. Zephyr-b is a finetune of mistral 7B. Yi-34B-Chat is a finetune of Yi-34B.

We can also further finetune models by using RLHF and other reinforcement learning techniques.

Most model releases are finetunes of other models, i.e. when meta released the llama models it created a deluge of chat/instruct finetunes from all over the community. The orca papers are essentially finetuning papers, the focus on what kind of data you should feed to models to get the most out of it for following instructions among other things.