What if all the computation power that's being wasted by crypto (Proof of Work, etc.) could be shifted to powering AI models in a decentralized way, such as this project?
You keep the AI "alive" and in return, you get paid in crypto. What a wonderful time to be alive!
Portion of a discussion I had on this topic with GPT-4:
> Your idea of having all miners work on the same section of the model and compete to find better weights with a lower loss function value is an interesting approach. This would essentially treat the LLM training process like a proof-of-work mining competition. It is true that this approach involves some degree of duplicate work, but the simultaneous, randomized exploration of the weight space could still leverage the distributed computing power of the network.
> Here's a high-level overview of how this approach could work:
> Miners receive a specific section of the LLM's weights and a subset of training data. The training data could be determined based on information from the previous block (e.g., by hashing the previous block).
> Each miner initializes their local copy of the weight section with random perturbations to introduce diversity in the optimization process.
> Miners perform optimization steps (e.g., gradient descent) on their local weight section to minimize the loss function on the given training data.
> Miners search for a solution that satisfies both the proof of training (improvement in the loss function) and the proof of work (hash of the block meeting the difficulty target).
> The first miner to find a solution meeting both criteria broadcasts their updated weight section and the new block to the network.
> Other miners verify the validity of the proposed solution (i.e., checking the improvement in the loss function and the hash meeting the difficulty target) and add the new block to their local copy of the blockchain.
> This approach would turn the mining process into a competitive LLM training process, where miners contribute their computing power towards improving the model. It maintains some of the core properties of proof-of-work mining while directing the computational resources towards a productive goal. However, this approach still needs to address potential issues related to data privacy, intellectual property, and the synchronization of the model's weights across the entire network.
What's the point of this hashing the block business if you can already prove your work by presenting a set of of weights that reduces the loss function? And even then, you run into the risk of overfitting if you just blindly optimize for loss like that.
I don’t believe that is true. Most technologies are incremental improvements on recombinations of existing tools and techniques. It seem likely to me that LLMs ability to map associations between concepts can result in humans using it to help invent new technology.
I try to base my judgments of what LLMs can and can't do primarily on my study and research in related fields. I haven't been surprised by the capabilities of any LLM yet, including GPT-4.
Is that a serious question? Studying a field for years should make outcomes in that field less surprising, otherwise what have you been doing?
The creators were surprised in the sense of "we got here sooner than expected" but not "we didn't think this would work". Otherwise they wouldn't have been working on it. And there is nothing new in LLMs in years, it's just increasing fidelity by massively increased scale.
To be honest, I've been more surprised by the incompetence of people in evaluating these systems, including journalists, programmers, and others who should be in a position to know better.
> The creators were surprised in the sense of "we got here sooner than expected" but not "we didn't think this would work". Otherwise they wouldn't have been working on it. And there is nothing new in LLMs in years, it's just increasing fidelity by massively increased scale.
This is categorically false. There are papers being published on all the surprising emergent behavior being observed.
I'm paying attention. I think "scale is all you need" is wrong even when it's right. We have a responsibility to not allow the capabilities to outstrip our ability to understand and control. If we don't do our job that will be the real "bitter lesson."
However, ultimately it's a text predictor driven by a PRNG and I stand by my statement; I think the systems are obviously impressive but the unrealistic expectations people have and the anthropomorphization and projection I'm seeing is even more impressive. Let me know when it starts synthesizing new science or math. By then we're in trouble.
That type of work is not useful for securing a decentralized system. To secure a decentralized blockchain, you need the work to be provably wasteful.
That said, there's a new paradigm of Blockchain brewing called 'Proof of Physical Work', which accepts the centralized limitations and leverages the incentive scheme to produce useful outcomes for society. Perhaps the most interesting one right now is HiveMapper.
It won't secure a decentralized blockchain, but it might still be a great value add for society.
> To secure a decentralized blockchain, you need the work to be provably wasteful.
But why does it have to be wasteful?
As I understand it you just need a hard enough computation to prevent the 51% problem.
Obviously if the goal was to compute something which had real value all the failed attempts would count as “provably wasteful” so would seem to fit the criteria.
> As I understand it you just need a hard enough computation to prevent the 51% problem.
that's not the case, in fact hardness is barely a requirement provided you can scale up the number of sub-tasks as much as you need
When the work is provably wasteful, the model simplifies a great deal. Usefulness of the work complicates the dynamics and makes many scenarios plausible that otherwise wouldn't. The game theory then doesn't follow that you'd choose a currency with extra elements of instability.
I don't think there's enough evidence to make the hard requirement that it HAS to be provably wasteful. It's just that you introduce a bunch on extra, usually intractable, problems when the work is useful.
Proof of stake involves lockup of capital, wasting the time value of money (despite it earning a small yield), and also taking on the risk of being slashed.
The training of LLMs I believe has a well defined loss function with a measurable value on a specific input set. I actually already discussed this problem extensively with GPT-4 itself so I’ll leave you with parts of that discussion:
> Me: Do LLMs have a well defined loss function that they optimize over?
> GPT-4 Assistant: Yes, large language models (LLMs) have a well-defined loss function that they optimize over during the training process. The choice of loss function depends on the specific task the model is designed for and the architecture used. For LLMs like GPT-3, which are based on the transformer architecture and designed for natural language processing tasks, the most commonly used loss function is the cross-entropy loss.
> Cross-entropy loss measures the difference between the model's predicted probability distribution over the possible output tokens (words, subwords, or characters, depending on the tokenization used) and the true probability distribution, which assigns a probability of 1 to the correct output token and 0 to all other tokens. The goal of the training process is to minimize the cross-entropy loss, which essentially means making the model's predicted probabilities as close as possible to the true probabilities.
> During training, the LLM processes input sequences (e.g., sentences or paragraphs) and generates output token probabilities based on its current weights. The cross-entropy loss is computed using these probabilities and the true target tokens, and then the gradients of the loss with respect to the model's weights are calculated using backpropagation. The weights are updated using an optimization algorithm, such as stochastic gradient descent (SGD) or one of its variants like Adam or Adagrad, in order to minimize the loss.
> The process of iteratively updating the model's weights to minimize the loss function is what allows LLMs to learn patterns and structure in the training data, ultimately enabling them to generate coherent text, answer questions, or perform other language-related tasks.
> Me: Is it possible to restrict the training with a rules that says only these weights, say specific slices, are allowed to change, keep all other weights fixed? Is this how LLMs can be trained in a distributed fashion?
> GPT-4 Assistant: Yes, it is possible to restrict the training of an LLM to specific slices or subsets of weights while keeping other weights fixed. This is a common approach used in distributed and parallel training scenarios, where multiple devices or nodes work on different parts of the model simultaneously. This approach is known as model parallelism.
> In model parallelism, the model's weights are divided into smaller subsets or partitions, with each partition being assigned to a different device or node. During training, each device works only on updating the weights in its assigned partition while keeping the weights in the other partitions fixed. The gradients are computed for each partition separately, and the updates are performed locally on each device.
> Once all devices complete their local updates, they communicate their updated weights with one another to synchronize the model's state. This communication can be done through various methods, such as parameter server architecture or all-reduce algorithms. The key to effective model parallelism is balancing the computational workload among the devices while minimizing the communication overhead.
> Adapting this concept for the proposed proof of work and LLM training combination could involve assigning different subsets of weights to miners and having them work on updating these subsets. This would essentially turn the mining process into a form of distributed LLM training, with each miner contributing to a portion of the model's weights. However, coordinating and synchronizing the updates among miners in a decentralized environment, while maintaining the integrity and security of the blockchain, would be a significant challenge that needs to be addressed in the design.
There is an AI generated spam joke inthere but I cant think of it right now. I'm much to overwhelmed [again] by the clarity of that explanation.
I believe someone posted a paper talking about the riddle at the end: Apparently one can also merge weights if work is done on a partition repeatedly/accidentally. The rest of the merger seems a kind of bittorrent(?)
Proof-of-work only works with a particular set of computational problems, i.e. those in NP. I'm not sure if running a LLM fits that bill.
I suppose you could combine proof-of-stake with it in some way (e.g. you commit to an input/output tuple and get your stake slashed if it can be shown to not reproduce on a "canonical" execution), though?
That's not nearly as simple as "normal" PoW, though – you'd need to encode some reputational aspect into the system.
You can definitely do arbitrary work as a sort of proof of work. Not quite the same mathematically, but pragmatically similar. The key is building in some redundancy/error-correction and ensuring that a single node can't by itself define "correctness" of a solution. You do that by duplicating work across nodes, distributing chunks randomly and rejecting/rerunning disagreeing results. It's also pretty easy to spot bad actors trying to cheat on their computational work in this scenario.
I don't think it's that easy at all. The work function must be cheap and canonical to check, and the consensus algorithm has to be rigorous, or else it's too easy to attack the security of the network. DoS, Sybil, 51%, social takeover via hard fork, the list goes on...
It has a well define loss function with a numerical value. The improvement of this value can be a type of difficulty. Check some other comments I’ve made on this post for how it might work.
It's an interesting idea for sure, but loss doesn't go down forever. I think this ends with a highly overfitted network that grinds to a halt as the loss function hits local minima.
Even if you get past that, there's no consensus mechanism or finalization as it stands, and validating solutions is relatively expensive.
We only just started thinking about this and I suspect these issues are solvable in a protocol. For instance using cross validation there must be a distributed protocol to control over fitting.
I’m not sure validation is so expensive if the data is small enough. Actually maybe that’s a way to approach this, two type of block that are paired and share the rewards in some way. One that proposes better a better splice of weights and another that proves they are better out of sample.
Give it a few weeks and with GPT-4s help I think we can find some promising approaches.
You keep the AI "alive" and in return, you get paid in crypto. What a wonderful time to be alive!