- Minimizing loss could be a useful heuristic on a base model. Here, we expect the distribution to be different as we are only doing RL. Measuring loss means we're measuring the difference against the base model inputs: a non-goal, we expect reasoning post RL-training to look quite different from a web scrape.
Let's set that aside. Let's say lower loss = model improved.
- Checking the loss requires the entire dataset used to train the base model + forward pass. That’s O(N·d) where N is samples, d is model size. This takes us from "cool demo of RL can be done on the edge with little benefit" to "we're shipping around terabytes of data constantly among clients"
- Proof of work as a technical term is different from proof of work as a colloquial term: the former is a cryptographic puzzle whose solution is universally and instantly checkable, while the latter just means “I can show I did something,” with no strict guarantee or uniqueness. Randomly perturbing one parameter could show "proof of work" without the work we actually wanted to be done, being done.
- Early in base model training, shaving 0.01 off the loss is easy. Later, impossible. In an RL environment, we're expecting some to go bad. In our interpretation of "loss decrease means model better means you did work", that would mean loss would increase -- that is how it learns in an RL environment. However, that does not mean no work is done.
- Minimizing loss could be a useful heuristic on a base model. Here, we expect the distribution to be different as we are only doing RL. Measuring loss means we're measuring the difference against the base model inputs: a non-goal, we expect reasoning post RL-training to look quite different from a web scrape.
Let's set that aside. Let's say lower loss = model improved.
- Checking the loss requires the entire dataset used to train the base model + forward pass. That’s O(N·d) where N is samples, d is model size. This takes us from "cool demo of RL can be done on the edge with little benefit" to "we're shipping around terabytes of data constantly among clients"
- Proof of work as a technical term is different from proof of work as a colloquial term: the former is a cryptographic puzzle whose solution is universally and instantly checkable, while the latter just means “I can show I did something,” with no strict guarantee or uniqueness. Randomly perturbing one parameter could show "proof of work" without the work we actually wanted to be done, being done.
- Early in base model training, shaving 0.01 off the loss is easy. Later, impossible. In an RL environment, we're expecting some to go bad. In our interpretation of "loss decrease means model better means you did work", that would mean loss would increase -- that is how it learns in an RL environment. However, that does not mean no work is done.