There's just no way to train dumb little computers on enough observed data. You have to centralize the AI training at least, and therefore centralize data collection. Most existing services (like Amazon Echo or Siri) also centralize the AI logic, requiring them to be online to function.
This setup of ubiquitous, always-online smart devices reporting to a central collection server is very hazardous. At the very least, we need to require that these devices be disconnectable with a hardware switch (without losing other functionality), and that the training data they send home be pre-cooked as much as possible.
Even that still means that data collection becomes more limited now (before: every, every, everything, like for any "cloud" product versus now: what they need for training)
"A technique called weight quantization, for example, represents each neural network parameter with only a few bits, sometimes a single bit, instead of the standard 32...The models are equally accurate, but the compressed version runs about 20 times faster."
This is pretty great. How can we tell if a particular ML problem is amenable to weight quantization without sacrificing accuracy?
weight quantization is basically a short list of shortened values used as an index for a lookup table that represents the desired full values
if you have a 24bit value.. say, a 24bit color, that means you have ~16million.. 2^24==16777216.. possible colors
but if you only want to use 200 colors you can, instead of representing them as the full 24bit value, use an 8bit value.. 2^8==256>200.. and have those 8bits represent a value in an index that points to the desired full 24bit value
so you have to ask yourself.. what parameters of my neural net can be represented as an index? or, what parameters are of a quantity less than the parameter values' size?
wiki defines ann parameters as:
An ANN is typically defined by three types of parameters:
The connection pattern between the different layers of neurons
The weights of the connections, which are updated in the learning process.
The activation function that converts a neuron's weighted input to its output activation.
here is a great paper that tries to answer this question for you in a way that highlights error resulting from quantization decisions(i)
By applying weight quantization and measuring the resulting loss in accuracy. Predicting which techniques will work or not in deep learning is still much harder than just stumbling on one that works by trial and error.
realizing the promise of a world populated with tiny intelligent devices at every turn – embedded in our clothes, scattered around our homes and offices
What an incredible responsibility we have to protect our families from the misuse of this capability.
I wonder if there are certain patterns that take a lot of parameters to express in an NN, that could be more efficiently represented using some other kind of logic, and that could be automatically discovered by some variety of algorithm such that subsection of the NN is replaced with this alternate form. I mean a simple case is that I'm sure that an NN trained to do multiplication is less efficient than just running a multiply op in the hardware. I'm talking about the complicated scenario where some subset of the NN is performing a replaceable and inefficient function.
Yes, this is a real technique. There are NNs that are mixed with regular programming. As data propagates through the code, a graph is created and gradients flow automatically backwards training the various neural net bits. All this is fully mixable with functions, loops, if's and math expressions, the only condition is that any instruction used has to allow for gradients to flow - so it needs to be able to assign blame correctly from outputs to inputs.
A second technique is to use deep learning to learn from stack traces. Any old software could be stack-traced by inserting a few prints here and there. Then a NN could learn recursive algorithms just by trying to recreate the whole stack trace, not just the actual outputs. It's a way to distill plain old programming into NNs, by incorporating side information that is cheap to get. This would be useful to quickly teach a NN some algorithm while making it less brittle than symbolic approaches. Imagine how many algorithms could be extracted from conventional software.
On the first one, I assume we are still talking hand-generation\coordination of the procedural bits. I was waving my hands at possibly learning the topology of those bits, possibly even from recognizing them as being reproduced [inefficiently] in a trained NN.
I don't think I've ever pondered that second technique before and it's very intriguing. Is there a canonical best-of-class in that category? Offhand, it sounds brutally hard to do.
Also, I think DeepMind might have published something on an NN that learned to write a procedural program that wrote a sort algorithm. Is that related?
If you train a large parameter NN, and then somehow prune it to a lower parameter NN, will it generally outperform an equally-sized lower parameter NN trained from scratch? I'm not talking about reducing bit-counts per node here, but total nodes.
Pack the whole miracle world into a box? Maybe- if you have enough boxes, you can path some trodden path with boxes.
Or you could stack them and put some marketing hor'selsman on top of it, praising cowardice as innovation.
It would all be joke, if something new was at least tryied.
But what the article tells us is- basically , that its time to go back to where we where before the cloud hype and get statistic models on the machines, in the machines and horray. Full Circle.
Till we are all in boxes.
Each time the cheese is moved, the economic order mutates, creating new winners and losers. That's reason enough to move in a circle, when you are not currently a winner.
There's just no way to train dumb little computers on enough observed data. You have to centralize the AI training at least, and therefore centralize data collection. Most existing services (like Amazon Echo or Siri) also centralize the AI logic, requiring them to be online to function.
This setup of ubiquitous, always-online smart devices reporting to a central collection server is very hazardous. At the very least, we need to require that these devices be disconnectable with a hardware switch (without losing other functionality), and that the training data they send home be pre-cooked as much as possible.