More

arugulum · on Feb 3, 2024

It is no coincidence that EleutherAI named their pretraining dataset "the Pile"

arugulum · on Feb 2, 2024

The Pythia models have all the training data, code, and configurations available.

arugulum · on Feb 2, 2024

EleutherAI as well.

arugulum · on Nov 29, 2023

This arguments feels like it's trying to be an inch too smart.

Consider the following: Amazon isn't really an online retail company; it doesn't really sell goods to the consumer. What it does is use "goods" that it delivers as a loss leader to get people to click on buttons to give Amazon money.

Starbucks is a coffee company, that introduces an optional extra step of gift cards for other consumer convenience reasons. The fact that there is a comically large amount of money held in the gift cards system is just that: a slightly comical fact.

meheleventyone · on Nov 29, 2023

Your attempted example isn’t the same though. It’s much more apparent if you look at the search example. Without search there would be no ads business (to a degree)but search itself only loses money and gets compromised in favor of the ads business

arugulum · on Oct 31, 2023

> the RoPE embeddings in Code Llama were designed for this.

The RoPE embeddings were not "designed" for that. The original RoPE was not designed with length extrapolation in mind. Subsequent tweaks to extrapolate RoPE (e.g. position interpolation) are post-hoc tweaks (with optional tuning) to an entirely vanilla RoPE implementation.

arugulum · on Oct 11, 2023

BERT was on arXiv before being peer reviewed. As were T5, BART, LLaMA, OPT and GPT-NeoX-20B. The Pile and FLAN were also on arXiv before being peer reviewed. Of course, the original Transformer paper was also on arXiv before being peer reviewed.

Being on arXiv before being peer reviewed is not the or even a problem.

arugulum · on Aug 16, 2023

I want to jump in and correct your usage of "LLaMA Laws" (even you are using it informally, but I just want to clarify).

There is no "LLaMA scaling law". There are a set of LLaMA training configurations.

Scaling laws describe the relationship between training compute, data, and expected loss (performance). Kaplan et al., estimated one set of laws, and the Chinchilla folks refined that estimate (mainly improving it by adjusting the learning rate schedule).

The LLaMA papers do not posit any new law nor contradict any prior one. They chose a specific training configuration that still abide by the scaling laws but with a different goal in mind.

(Put another way: a scaling law doesn't tell you what configuration to train on. It tells you what to expect given a configuration, but you're free to decide on whatever configuration you want.)

npsomaratna · on Aug 17, 2023

Isn't the Chinchilla estimate considered to be wrong now?

https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

FanaHOVA · on Aug 16, 2023

Yep, +1. That's why I used the quotes. :) Thanks for expanding!

arugulum · on Aug 16, 2023

Yep I understood that you were using it informally, just trying to keep things informative for other folks reading too.

swyx · on Aug 16, 2023

there frankly needs to be a paper calling this out tho, because at this point there are a bunch of industry models following “llama laws” and nobody’s really done the research, its all monkey see monkey do

arugulum · on Aug 16, 2023

But what would they be calling out?

If industry groups want to run a training run based on the configurations of a well-performing model, I don't see anything wrong with that. Now, if they were to claim that what they are doing is somehow "optimal", then there would be something to criticize.

swyx · on Aug 16, 2023

poor choice of words, i probably mean sketching out the curves/doing ablation studies in a comprehensive way like the chinchilla paper did.

arugulum · on Aug 16, 2023

Makes sense! But expensive...

arugulum · on Aug 16, 2023

If you want a speedrun explanation for how we get to "2": In the limit of model scaling, context size doesn't matter (yes, forget about the quadratic attention), most of the compute is in the linear layers, which boil down to matrix multiplies. Consider a single matrix of size [T,d] multiplied by weight of size [d,d], the compute needed for a matrix multiplication is approximately 2Td^2 (2 coming from multiply + add). Swap T out with D for your whole dataset in tokens, d^2 is the number of parameters in a single linear layer so scale up your model to P, and you've got 2PD.

Even shorter: The 2 comes from the multiply-add

arugulum · on Aug 8, 2023

It's actually even less remarkable than that. It was an experiment in having a limited release, to shift the field toward a different release convention.

> Nearly a year ago we wrote in the OpenAI Charter: “we expect that safety and security concerns will reduce our traditional publishing in the future, while increasing the importance of sharing safety, policy, and standards research,” and we see this current work as potentially representing the early beginnings of such concerns, which we expect may grow over time.

> This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas.

> We will further publicly discuss this strategy in six months.

https://openai.com/research/better-language-models

arugulum · on Aug 5, 2023

While MoE-LoRAs are exciting in themselves, they are a very different pitch from full on MoEs. If the idea behind MoEs is that you want completely separate layers to handle different parts of the input/computation, then it is unlikely that you can get away with low-rank tweaks to an existing linear layer.