What if all the computation power that's being wasted by crypto (Proof of Work, ...

alchemist1e9 · on March 20, 2023

Portion of a discussion I had on this topic with GPT-4:

> Your idea of having all miners work on the same section of the model and compete to find better weights with a lower loss function value is an interesting approach. This would essentially treat the LLM training process like a proof-of-work mining competition. It is true that this approach involves some degree of duplicate work, but the simultaneous, randomized exploration of the weight space could still leverage the distributed computing power of the network.

> Here's a high-level overview of how this approach could work:

> Miners receive a specific section of the LLM's weights and a subset of training data. The training data could be determined based on information from the previous block (e.g., by hashing the previous block).

> Each miner initializes their local copy of the weight section with random perturbations to introduce diversity in the optimization process.

> Miners perform optimization steps (e.g., gradient descent) on their local weight section to minimize the loss function on the given training data.

> Miners search for a solution that satisfies both the proof of training (improvement in the loss function) and the proof of work (hash of the block meeting the difficulty target).

> The first miner to find a solution meeting both criteria broadcasts their updated weight section and the new block to the network.

> Other miners verify the validity of the proposed solution (i.e., checking the improvement in the loss function and the hash meeting the difficulty target) and add the new block to their local copy of the blockchain.

> This approach would turn the mining process into a competitive LLM training process, where miners contribute their computing power towards improving the model. It maintains some of the core properties of proof-of-work mining while directing the computational resources towards a productive goal. However, this approach still needs to address potential issues related to data privacy, intellectual property, and the synchronization of the model's weights across the entire network.

vore · on March 21, 2023

What's the point of this hashing the block business if you can already prove your work by presenting a set of of weights that reduces the loss function? And even then, you run into the risk of overfitting if you just blindly optimize for loss like that.

inimino · on March 21, 2023

Eventually it will be common knowledge that asking a text compressor trained on all available text to design a new technology doesn't work.

alchemist1e9 · on March 21, 2023

I don’t believe that is true. Most technologies are incremental improvements on recombinations of existing tools and techniques. It seem likely to me that LLMs ability to map associations between concepts can result in humans using it to help invent new technology.

Have you personally used GPT-4 much?

inimino · on March 23, 2023

I try to base my judgments of what LLMs can and can't do primarily on my study and research in related fields. I haven't been surprised by the capabilities of any LLM yet, including GPT-4.

alchemist1e9 · on March 24, 2023

Are you kidding me? Pretty much everyone is surprised, including the creators, how are you not?

inimino · on March 24, 2023

Is that a serious question? Studying a field for years should make outcomes in that field less surprising, otherwise what have you been doing?

The creators were surprised in the sense of "we got here sooner than expected" but not "we didn't think this would work". Otherwise they wouldn't have been working on it. And there is nothing new in LLMs in years, it's just increasing fidelity by massively increased scale.

To be honest, I've been more surprised by the incompetence of people in evaluating these systems, including journalists, programmers, and others who should be in a position to know better.

alchemist1e9 · on March 24, 2023

> The creators were surprised in the sense of "we got here sooner than expected" but not "we didn't think this would work". Otherwise they wouldn't have been working on it. And there is nothing new in LLMs in years, it's just increasing fidelity by massively increased scale.

This is categorically false. There are papers being published on all the surprising emergent behavior being observed.

inimino · on March 25, 2023

Emergent, as in, "we cannot explain how this works," yes. That is nothing new in the field of ML or to anyone who has been paying attention.

alchemist1e9 · on March 27, 2023

https://youtu.be/StLtMcsbQes

I think you have your head in the sand and haven’t been paying attention.

The scaling laws are not expected. The capabilities of GPT-3.5 are beyond what even those deeply involved had expected.

I also think the progress is likely going exponential at this point. Multi agent and recursive prompting are coming soon.

This is really not ML at all. I have extensive traditional ML knowledge and background. I know in detail the typical model suspects on a Kaggle board.

LLMs are totally new and surprising relative to my many decades working with ML and traditional NLP.

inimino · on March 28, 2023

That's a good talk.

I'm paying attention. I think "scale is all you need" is wrong even when it's right. We have a responsibility to not allow the capabilities to outstrip our ability to understand and control. If we don't do our job that will be the real "bitter lesson."

However, ultimately it's a text predictor driven by a PRNG and I stand by my statement; I think the systems are obviously impressive but the unrealistic expectations people have and the anthropomorphization and projection I'm seeing is even more impressive. Let me know when it starts synthesizing new science or math. By then we're in trouble.

Taek · on March 20, 2023

That type of work is not useful for securing a decentralized system. To secure a decentralized blockchain, you need the work to be provably wasteful.

That said, there's a new paradigm of Blockchain brewing called 'Proof of Physical Work', which accepts the centralized limitations and leverages the incentive scheme to produce useful outcomes for society. Perhaps the most interesting one right now is HiveMapper.

It won't secure a decentralized blockchain, but it might still be a great value add for society.

UncleEntity · on March 21, 2023

> To secure a decentralized blockchain, you need the work to be provably wasteful.

But why does it have to be wasteful?

As I understand it you just need a hard enough computation to prevent the 51% problem.

Obviously if the goal was to compute something which had real value all the failed attempts would count as “provably wasteful” so would seem to fit the criteria.

muyuu · on March 23, 2023

> As I understand it you just need a hard enough computation to prevent the 51% problem.

that's not the case, in fact hardness is barely a requirement provided you can scale up the number of sub-tasks as much as you need

When the work is provably wasteful, the model simplifies a great deal. Usefulness of the work complicates the dynamics and makes many scenarios plausible that otherwise wouldn't. The game theory then doesn't follow that you'd choose a currency with extra elements of instability.

I don't think there's enough evidence to make the hard requirement that it HAS to be provably wasteful. It's just that you introduce a bunch on extra, usually intractable, problems when the work is useful.

sebzim4500 · on March 21, 2023

So? If we start offering crypto to run sections of LLMs, then some portion of the crypto miners will move to doing something useful instead.

hu3 · on March 20, 2023

What about Eth's proof of stake?

pshc · on March 21, 2023

Proof of stake involves lockup of capital, wasting the time value of money (despite it earning a small yield), and also taking on the risk of being slashed.

antibasilisk · on March 20, 2023

by definition that makes it not a blockchain

alchemist1e9 · on March 20, 2023

I don’t believe it’s accurate what you claim.

The training of LLMs I believe has a well defined loss function with a measurable value on a specific input set. I actually already discussed this problem extensively with GPT-4 itself so I’ll leave you with parts of that discussion:

> Me: Do LLMs have a well defined loss function that they optimize over?

> GPT-4 Assistant: Yes, large language models (LLMs) have a well-defined loss function that they optimize over during the training process. The choice of loss function depends on the specific task the model is designed for and the architecture used. For LLMs like GPT-3, which are based on the transformer architecture and designed for natural language processing tasks, the most commonly used loss function is the cross-entropy loss.

> Cross-entropy loss measures the difference between the model's predicted probability distribution over the possible output tokens (words, subwords, or characters, depending on the tokenization used) and the true probability distribution, which assigns a probability of 1 to the correct output token and 0 to all other tokens. The goal of the training process is to minimize the cross-entropy loss, which essentially means making the model's predicted probabilities as close as possible to the true probabilities.

> During training, the LLM processes input sequences (e.g., sentences or paragraphs) and generates output token probabilities based on its current weights. The cross-entropy loss is computed using these probabilities and the true target tokens, and then the gradients of the loss with respect to the model's weights are calculated using backpropagation. The weights are updated using an optimization algorithm, such as stochastic gradient descent (SGD) or one of its variants like Adam or Adagrad, in order to minimize the loss.

> The process of iteratively updating the model's weights to minimize the loss function is what allows LLMs to learn patterns and structure in the training data, ultimately enabling them to generate coherent text, answer questions, or perform other language-related tasks.

> Me: Is it possible to restrict the training with a rules that says only these weights, say specific slices, are allowed to change, keep all other weights fixed? Is this how LLMs can be trained in a distributed fashion?

> GPT-4 Assistant: Yes, it is possible to restrict the training of an LLM to specific slices or subsets of weights while keeping other weights fixed. This is a common approach used in distributed and parallel training scenarios, where multiple devices or nodes work on different parts of the model simultaneously. This approach is known as model parallelism.

> In model parallelism, the model's weights are divided into smaller subsets or partitions, with each partition being assigned to a different device or node. During training, each device works only on updating the weights in its assigned partition while keeping the weights in the other partitions fixed. The gradients are computed for each partition separately, and the updates are performed locally on each device.

> Once all devices complete their local updates, they communicate their updated weights with one another to synchronize the model's state. This communication can be done through various methods, such as parameter server architecture or all-reduce algorithms. The key to effective model parallelism is balancing the computational workload among the devices while minimizing the communication overhead.

> Adapting this concept for the proposed proof of work and LLM training combination could involve assigning different subsets of weights to miners and having them work on updating these subsets. This would essentially turn the mining process into a form of distributed LLM training, with each miner contributing to a portion of the model's weights. However, coordinating and synchronizing the updates among miners in a decentralized environment, while maintaining the integrity and security of the blockchain, would be a significant challenge that needs to be addressed in the design.

6510 · on March 21, 2023

There is an AI generated spam joke inthere but I cant think of it right now. I'm much to overwhelmed [again] by the clarity of that explanation.

I believe someone posted a paper talking about the riddle at the end: Apparently one can also merge weights if work is done on a partition repeatedly/accidentally. The rest of the merger seems a kind of bittorrent(?)

lxgr · on March 20, 2023

Proof-of-work only works with a particular set of computational problems, i.e. those in NP. I'm not sure if running a LLM fits that bill.

I suppose you could combine proof-of-stake with it in some way (e.g. you commit to an input/output tuple and get your stake slashed if it can be shown to not reproduce on a "canonical" execution), though?

That's not nearly as simple as "normal" PoW, though – you'd need to encode some reputational aspect into the system.

Enginerrrd · on March 20, 2023

You can definitely do arbitrary work as a sort of proof of work. Not quite the same mathematically, but pragmatically similar. The key is building in some redundancy/error-correction and ensuring that a single node can't by itself define "correctness" of a solution. You do that by duplicating work across nodes, distributing chunks randomly and rejecting/rerunning disagreeing results. It's also pretty easy to spot bad actors trying to cheat on their computational work in this scenario.

pshc · on March 20, 2023

I don't think it's that easy at all. The work function must be cheap and canonical to check, and the consensus algorithm has to be rigorous, or else it's too easy to attack the security of the network. DoS, Sybil, 51%, social takeover via hard fork, the list goes on...

alchemist1e9 · on March 20, 2023

It has a well define loss function with a numerical value. The improvement of this value can be a type of difficulty. Check some other comments I’ve made on this post for how it might work.

pshc · on March 21, 2023

It's an interesting idea for sure, but loss doesn't go down forever. I think this ends with a highly overfitted network that grinds to a halt as the loss function hits local minima.

Even if you get past that, there's no consensus mechanism or finalization as it stands, and validating solutions is relatively expensive.

alchemist1e9 · on March 21, 2023

We only just started thinking about this and I suspect these issues are solvable in a protocol. For instance using cross validation there must be a distributed protocol to control over fitting.

I’m not sure validation is so expensive if the data is small enough. Actually maybe that’s a way to approach this, two type of block that are paired and share the rewards in some way. One that proposes better a better splice of weights and another that proves they are better out of sample.

Give it a few weeks and with GPT-4s help I think we can find some promising approaches.

saulpw · on March 20, 2023

And now we have an AI that is impossible to turn off.

joshSzep · on March 20, 2023

Reminds me of the short story Stephen Hawking tells about AI in this video https://youtu.be/T8y5EXFMD4s

behnamoh · on March 20, 2023

And impossible to censor.

lgas · on March 21, 2023

I believe that's pretty close to what https://bittensor.com/ does.

amelius · on March 21, 2023

What if we moved to "Proof-of-Carbon-capture" instead?

eclipxe · on March 20, 2023

Let's do it.

hhoodie · on March 20, 2023

I'm in

nullsense · on March 20, 2023

Let's watch the world burn!

int_19h · on March 21, 2023

The world is capable of burning itself just fine without such assistance.

It would be much neater to turn it all into paperclips instead.

antibasilisk · on March 20, 2023

it's all fun and games until a bunch of kids die

alchemist1e9 · on March 20, 2023

me too