AI’s big leap to tiny devices opens world of possibilities

idlewords · on June 30, 2017

AI + tiny devices = automatic surveillance.

There's just no way to train dumb little computers on enough observed data. You have to centralize the AI training at least, and therefore centralize data collection. Most existing services (like Amazon Echo or Siri) also centralize the AI logic, requiring them to be online to function.

This setup of ubiquitous, always-online smart devices reporting to a central collection server is very hazardous. At the very least, we need to require that these devices be disconnectable with a hardware switch (without losing other functionality), and that the training data they send home be pre-cooked as much as possible.

EGreg · on June 30, 2017

The article is about doing the calculations on-board. This is about decentralizing what is now centralized!

idlewords · on June 30, 2017

The training remains centralized, and drives the data collection.

The article mentions de-centralized training (in the "bottom up" section), but that begins by saying we have to reinvent the entire field.

candiodari · on July 1, 2017

Even that still means that data collection becomes more limited now (before: every, every, everything, like for any "cloud" product versus now: what they need for training)

shenberg · on July 1, 2017

Decetralized learning is also possible.

https://research.googleblog.com/2017/04/federated-learning-c...

pmoriarty · on July 1, 2017

Tiny devices are also attack vectors, as if there weren't enough vulnerable devices on the Internet.

oh_teh_meows · on July 1, 2017

"A technique called weight quantization, for example, represents each neural network parameter with only a few bits, sometimes a single bit, instead of the standard 32...The models are equally accurate, but the compressed version runs about 20 times faster."

This is pretty great. How can we tell if a particular ML problem is amenable to weight quantization without sacrificing accuracy?

justifier · on July 1, 2017

weight quantization is basically a short list of shortened values used as an index for a lookup table that represents the desired full values

if you have a 24bit value.. say, a 24bit color, that means you have ~16million.. 2^24==16777216.. possible colors

but if you only want to use 200 colors you can, instead of representing them as the full 24bit value, use an 8bit value.. 2^8==256>200.. and have those 8bits represent a value in an index that points to the desired full 24bit value

so you have to ask yourself.. what parameters of my neural net can be represented as an index? or, what parameters are of a quantity less than the parameter values' size?

wiki defines ann parameters as:

An ANN is typically defined by three types of parameters:

    The connection pattern between the different layers of neurons
    The weights of the connections, which are updated in the learning process.
    The activation function that converts a neuron's weighted input to its output activation.

here is a great paper that tries to answer this question for you in a way that highlights error resulting from quantization decisions(i)

(o) https://en.wikipedia.org/wiki/Artificial_neural_network#Netw...

(i) https://www.cmpe.boun.edu.tr/~ethem/files/papers/fatih_icann...

yorwba · on July 1, 2017

By applying weight quantization and measuring the resulting loss in accuracy. Predicting which techniques will work or not in deep learning is still much harder than just stumbling on one that works by trial and error.

pcunite · on June 30, 2017

realizing the promise of a world populated with tiny intelligent devices at every turn – embedded in our clothes, scattered around our homes and offices

What an incredible responsibility we have to protect our families from the misuse of this capability.

Aron · on July 1, 2017

I wonder if there are certain patterns that take a lot of parameters to express in an NN, that could be more efficiently represented using some other kind of logic, and that could be automatically discovered by some variety of algorithm such that subsection of the NN is replaced with this alternate form. I mean a simple case is that I'm sure that an NN trained to do multiplication is less efficient than just running a multiply op in the hardware. I'm talking about the complicated scenario where some subset of the NN is performing a replaceable and inefficient function.

visarga · on July 1, 2017

Yes, this is a real technique. There are NNs that are mixed with regular programming. As data propagates through the code, a graph is created and gradients flow automatically backwards training the various neural net bits. All this is fully mixable with functions, loops, if's and math expressions, the only condition is that any instruction used has to allow for gradients to flow - so it needs to be able to assign blame correctly from outputs to inputs.

A second technique is to use deep learning to learn from stack traces. Any old software could be stack-traced by inserting a few prints here and there. Then a NN could learn recursive algorithms just by trying to recreate the whole stack trace, not just the actual outputs. It's a way to distill plain old programming into NNs, by incorporating side information that is cheap to get. This would be useful to quickly teach a NN some algorithm while making it less brittle than symbolic approaches. Imagine how many algorithms could be extracted from conventional software.

Aron · on July 1, 2017

Thanks for that. Very interesting.

On the first one, I assume we are still talking hand-generation\coordination of the procedural bits. I was waving my hands at possibly learning the topology of those bits, possibly even from recognizing them as being reproduced [inefficiently] in a trained NN.

I don't think I've ever pondered that second technique before and it's very intriguing. Is there a canonical best-of-class in that category? Offhand, it sounds brutally hard to do.

Also, I think DeepMind might have published something on an NN that learned to write a procedural program that wrote a sort algorithm. Is that related?

visarga · on July 2, 2017

After some considerable digging I came out with these papers:

Differentiable Programs with Neural Libraries - https://www.microsoft.com/en-us/research/wp-content/uploads/...

Making Neural Programming Architectures Generalize via Recursion - https://openreview.net/forum?id=BkbY4psgg

Neural Programmer Interpreters - https://arxiv.org/abs/1511.06279

Aron · on July 2, 2017

Your first paper references the one I was thinking of: the Turing NN. Thanks. I hope you learned something useful while digging.

teekert · on June 30, 2017

I wonder what OS they run on the Pi. Since they don't mention Windows IoT Core at all... I think machine learning is mainly a Linux field, right?

Aron · on July 1, 2017

If you train a large parameter NN, and then somehow prune it to a lower parameter NN, will it generally outperform an equally-sized lower parameter NN trained from scratch? I'm not talking about reducing bit-counts per node here, but total nodes.

louithethrid · on June 30, 2017

Pack the whole miracle world into a box? Maybe- if you have enough boxes, you can path some trodden path with boxes. Or you could stack them and put some marketing hor'selsman on top of it, praising cowardice as innovation. It would all be joke, if something new was at least tryied.

But what the article tells us is- basically , that its time to go back to where we where before the cloud hype and get statistic models on the machines, in the machines and horray. Full Circle. Till we are all in boxes.

walterbell · on June 30, 2017

Each time the cheese is moved, the economic order mutates, creating new winners and losers. That's reason enough to move in a circle, when you are not currently a winner.

homarp · on July 1, 2017

hidden in the article is the link to the github repo: https://github.com/Microsoft/ELL/